RNDLP: A Distributed Framework for Supporting Continuous k-Similarity Trajectories Search over Road Network

Jiang, Hong; Tong, Sainan; Zhu, Rui; Wei, Baoze

doi:10.3390/math12020270

Open AccessArticle

RNDLP: A Distributed Framework for Supporting Continuous k-Similarity Trajectories Search over Road Network

¹

School of Management, Shenyang University of Technology, Shenyang 110870, China

²

School of Computer Science, Shenyang Aerospace University, Shenyang 110136, China

³

Department of Energy Technology, Aalborg University, 9220 Aalborg, Denmark

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Mathematics 2024, 12(2), 270; https://doi.org/10.3390/math12020270

Submission received: 7 December 2023 / Revised: 8 January 2024 / Accepted: 9 January 2024 / Published: 14 January 2024

(This article belongs to the Special Issue Mathematical Modeling for Parallel and Distributed Processing)

Download

Browse Figures

Versions Notes

Abstract

Continuous k-similarity trajectories search over a data stream is an important problem in the domain of spatio-temporal databases. Given a set of trajectories

T

and a query trajectory

T_{q}

over road network

G

, the system monitors trajectories within

T

, reporting k trajectories that are the most similar to

T_{q}

whenever one time unit is passed. Some existing works study k-similarity trajectories search over trajectory data, but they cannot work in a road network environment, especially when the trajectory set scale is large. In this paper, we propose a novel framework named RNDLP (Road Network-based Distance Lower-bound-based Prediction) to support CKTRN over trajectory data. It is a distributed framework based on the following observation. That is, given a trajectory

T_{i}

and the query trajectory

T_{q}

, when we have knowledge of D(

T_{i}

), we can compute the lower-bound and upper-bound distances between

T_{q}

and

T_{i}

, which enables us to predict the scores of trajectories in

T

and employ these predictions to assess the significance of trajectories within

T

. Accordingly, we can form a mathematical model to evaluate the excepted running cost of each trajectory we should spend. Based on the model, we propose a partition algorithm to partition trajectories into a group of servers so as to guarantee that the workload of each server is as the same as possible. In each server, we propose a pair-based algorithm to predict the earliest time

T_{i}

could become a query result, and use the predicted result to organize these trajectories. Our proposed algorithm helps us support query processing via accessing a few points of a small number of trajectories whenever trajectories are updated. Finally, we conduct extensive performance studies on large, real, and synthetic datasets, which demonstrate that our new framework could efficiently support CKST over a data stream.

Keywords:

trajectory stream; k-similarity trajectories search; distributed; continuous query

MSC:

68P20

1. Introduction

This paper addresses the challenge of continuous k-similarity trajectories (abbreviated as CKTRN) search over road networks, a problem with various applications [1,2,3,4,5]. Notably, CKTRN finds applications in diverse domains [6,7,8,9]. For instance, it proves beneficial in identifying compressed representations of trajectories while preserving their essential characteristics, leading to reduced storage requirements and transmission costs. Moreover, CKTRN plays a pivotal role in traffic analysis by uncovering trajectories with consistent patterns and behaviors. This information is valuable for predicting congestion, understanding traffic flow, and optimizing road networks. Lastly, it facilitates the clustering and grouping of moving objects exhibiting similar movement patterns, providing insights for urban planning, traffic analysis, and beyond.

Let

G 〈 V, E 〉

represent the road network with V being the vertex set and E the edge set [1,10]. Each edge e is represented by the tuple

〈 v_{s}, v_{e}, w 〉

, where

v_{s}

and

v_{e}

denote the starting and ending points of the edge e, respectively, and w refers to the weight of e, which equals the distance between

v_{s}

and

v_{e}

within G. Consequently, the trajectory of a moving object o across G is defined as the tuple

T 〈 o, P, n 〉

. This includes a collection of n points

p_{1}, p_{2}, \dots, p_{n}

generated by o over the last n time units. Each point

p \in P

within G is represented as

p 〈 e, v, d 〉

, referring to the fact that p is positioned on edge e, arriving at vertex v after covering a distance of d. In this paper, points in P are modeled by a time-based window [11]. Under this setting, points are generated during the last n time unit. Whenever one time unit is passed, the first point

p_{1} \in P

could be regarded as an expired point, and we remove it from P. A newly generated point is inserted into P.

Let

T_{q}

represent a query trajectory [1,2,12], which monitors a set of trajectories denoted as

T

. Whenever one time unit is passed, it performs a search among the trajectories in

T

and returns the k trajectories with the lowest scores to the system. In this context, the score of a trajectory

T_{i} \in T

, denoted as D(

T_{i}

), is determined by the distance between

T_{i}

and the query trajectory

T_{q}

. Given a trajectory

T \in T

and a query trajectory

T_{q}

, the distance between the corresponding point

T (p_{i})

and

T_{q} (p_{i})

is defined as the shortest distance between these points within the road network G. The distance between trajectory T and query trajectory

T_{q}

is calculated as the sum of distances among their corresponding points.

Take an example in Figure 1a. There are three trajectories

{T_{1}, T_{2}, T_{3}}

contained in

T

. Each trajectory

T_{i}

contains four GPS points generated by the moving object

o_{i}

during the last four time units, i.e.,

T_{2}

contains four GPS points generated by the moving object

o_{2}

, which are

{p_{2}^{1}, p_{2}^{2}, p_{2}^{3}, p_{2}^{4}}

.

T_{q}

is the query trajectory,

p_{q}^{1} 〈 e 〈 v_{0}, v_{1}, 500 〉, v_{1}, 100 〉

refers to the fact that

p_{q}^{1}

is the first GPS point of trajectory

T_{q}

and is positioned on edge

e 〈 v_{0}, v_{1}, 500 〉

, arriving at vertex

v_{1}

after covering a distance of 100 m. Assume that the moving objects can travel up to 100 m per time unit. The distance between

T_{2}

and

T_{q}

equals 1600 (=100 + 300 + 500 + 700). The distances among

T_{q}

and these three trajectories are

{2000, 1600, 2000}

, respectively. As

k = 1

, the query result is

{T_{2}}

. As shown in Figure 1b, after one time unit is passed, points in

T_{2}

are updated to

{p_{2}^{2}, p_{2}^{3}, p_{2}^{4}, p_{2}^{5}}

, the distance between

T_{q}

and these three trajectories are updated to

{1200, 2300, 2000}

, respectively, and the query results are updated to

{T_{1}}

.

This approach also holds significant practical applications. For instance, it can play a crucial role in a real-time anti-tracking system. Specifically, a user u might submit a request to the system to check if there are any existing trajectories tracking u. The system can fulfill this request by identifying k trajectories that are most similar to the real-time trajectory generated by u. If certain trajectories closely resemble this trajectory, it suggests potential tracking of u, and the system can promptly relay this information to u. Additionally, CKST has the potential for extensions into other real-time systems such as online car sharing [13], popular route identification, and more.

Numerous researchers have studied the problem of k-similarity trajectories search [14,15,16,17,18]. However, a major portion of these endeavors have primarily concentrated on addressing k-similarity trajectories search over static or historical trajectory datasets. Sacharidis et al. [13] explored the CKTRN problem within the context of data streams, enabling the retrieval of similar trajectories in real time. Yet, their approach measures similarity by considering the distance between two representative points corresponding to these trajectories (as discussed in Section 2). This method often fails to accurately evaluate trajectory similarity in various scenarios. Another effort proposed by Zhu et al. [19] investigated CKTRN over GPS point-based data streams. However, the similarity between two trajectories is measured based on the Euclidean distance among the corresponding GPS points, which cannot effectively work under a road network. Therefore, an efficient algorithm that could both accurately evaluate similarity among trajectories and support CKTRN in real time is desired.

However, efficiently supporting CKTRN over a road network poses several challenges. Firstly, the scale of each trajectory is typically large. In the context of data streams, where trajectory points are frequently updated, efficiently updating trajectory scores in real time becomes a challenging task. Secondly, the trajectory set scale is also extensive. As the window slides (with each time unit passing), updating scores for all trajectories and efficiently identifying new query result trajectories from a large set of trajectories involve a substantial computational cost. Moreover, maintaining all trajectories in a single server is challenging. Thirdly, in the context of road network environments, the computational overhead involved in calculating the distance between two points introduces its own set of challenges, further escalating the overall computational cost.

In this paper, we introduce a novel framework called RNDLP (Road Network-based Distance Lower-bound-based Prediction) designed to support CKTRN over road networks. This framework is distributed and relies on two key observations. Firstly, considering a road network

G 〈 V, E 〉

, both edges and vertices in E and V do not frequently update. Consequently, we can pre-calculate the shortest distances among all vertices in G and employ these pre-computed values to streamline distance calculations among trajectory points.

Secondly, considering a non-query result trajectory

T_{i}

and the query trajectory

T_{q}

, when we have knowledge of D(

T_{i}

), we can compute the lower-bound and upper-bound distances between

T_{q}

and

T_{i}

. This computation enables us to predict the scores of trajectories in

T

and use these predictions to assess the significance of trajectories within

T

. Essentially, if the lower-bound score of

T_{i}

remains consistently high over numerous time units, it indicates that this trajectory is not likely to become a query result for an extended period. As a result, such a trajectory holds lower importance, and there is no need to closely monitor it over the long term. Conversely, trajectories with fluctuating scores require more frequent monitoring. In essence, only a small subset of trajectories needs score tracking with each passing time unit. In summary, our contributions can be outlined as follows.

Hash-based Distance Calculation. We introduce a hash-based index to manage distances between points within the road network G. Specifically, we pre-compute the distances between vertices using the Floyd algorithm and use a hash table to maintain distance among any two vertices. In this way, we need not to spend high running costs in calculating the distance between two points. Alternatively, we can use $O (1)$ running cost in computing the distance between two GPS points over a road network.
Pair-based Dynamic Prediction Algorithm. We introduce a novel algorithm called PAIRDP (short for PAIR-based Dynamic Prediction) as an enhancement of the PDSP algorithm discussed in [20]. PAIRDP brings improvements in two main aspects. Firstly, it harnesses the inherent spatiotemporal correlation in GPS points to improve the accuracy of predicting the optimal moment for trajectories to potentially become query results. This correlation contributes to refining the prediction process and achieving more precise results. Secondly, PAIRDP incorporates a dynamic adjustment mechanism for predicting the moments when trajectories could potentially become query results. This adjustment relies on the scores of the query result trajectories. By integrating this dynamic adjustment, the algorithm can significantly reduce the frequency of trajectory access, resulting in improved operational efficiency.
Model-based Partition Algorithm. Utilizing the prediction results, we can establish a cost model to assess the anticipated running cost of each trajectory. Consequently, we introduce a greedy algorithm to partition trajectories into different servers, ensuring that the workload of each server is as evenly distributed as possible. Additionally, we propose an incremental maintenance algorithm to adapt the partition under a data stream.

The remainder of this paper is structured as follows. Section 2 provides an overview of the existing literature in the field and outlines the problem definition. Section 3 introduces our proposed framework. In Section 4, we present the outcomes of our comprehensive experimental evaluation. Lastly, Section 5 offers concluding remarks by summarizing our key findings.

2. Preliminary

In this section, we will first review some important existing results related to k-similarity trajectory search. We will then introduce the problem definition.

2.1. Related Works

In recent years, researchers have focused on addressing the challenge of trajectory similarity search [8,10,21]. The endeavors in this domain can be categorized into two parts: ad hoc k-similarity trajectories queries and continuous k-similarity trajectories queries. Ad hoc k-similarity trajectory queries concentrating on enhancing query result accuracy through the design of similarity functions. For example, Lei Chen et al. [22] devised the EDR (Edit Distance on Real sequence) similarity measure function. Gajanan Gawde [23] leveraged trajectory polygon shapes for similarity comparisons.

The continuous k-similarity trajectories query can be further categorized into historical trajectory data-based and streaming trajectory data-based methods. Historical trajectory data-based approaches, exemplified by Güting et al. [20], utilize spatio-temporal indexes like R-trees to support query processing. They consider each trajectory as a sequence of units, constructing an index structure to facilitate a k-nearest neighbor search. On the other hand, streaming trajectory data-based methodologies, as studied by Sacharidis et al. [13], handle the continuous updating of trajectory data. These methods evaluate the distance between a query object and other mobile objects within a time window based on their maximum or minimum distance over timestamps. However, this approach utilizes only a subset of GPS points to represent trajectory distances, potentially affecting query precision.

Zhu et al. [19] proposed a sketch-based prediction algorithm to support a GPS point-based k-similarity trajectories search. The algorithm is based on the following observation. That is, if the distance between the last points of the query trajectory, denoted as

p_{q}^{n}

, and the corresponding point in a trajectory

T_{i}

is large, the overall distance of

T_{i}

to

T_{q}

, denoted as D(

T_{i}

), will also be large. The authors provide formal bounds for the upper bound (

\bar{b}

(

T_{i}

)) and lower bound (

\underset{̲}{b}

(

T_{i}

)) of

T_{i}

based on this observation. In addition to the score bounds calculation, Zhu et al. propose a structure called Partition-based Distance Sketch (PDS). The PDS is designed to summarize the distance distribution among GPS points in each trajectory

T_{i}

and the query trajectory

T_{q}

. It provides a compact representation of the distances between points within each trajectory, enabling efficient prediction and query result determination. Using the PDS, the algorithm can avoid continuously monitoring every trajectory until their predicted moments (

T_{i} . t

). Consequently, it can save lots of running costs, and the corresponding GPS points generated within the time interval

[T_{i} . s, T_{i} . t - 1]

can be safely deleted, further reducing the storage requirements.

Discussion. In summary, most algorithms are based on the historical trajectory database, which cannot support query processing in real time. The algorithm proposed by Dimitris et al. cannot accurately evaluate similarity among trajectories in many cases. The effort proposed by Zhu et al. is based on GPS points. Therefore, an efficient algorithm that could both accurately evaluate similarity among trajectories and support the k-similar trajectories search in real time is desired.

2.2. Problem Definition

A time-based [24] window can be formally defined as a data organization and management technique that partitions a continuous stream of data into fixed or variable time intervals. Formally, a time-based window can be represented as a tuple

W (T, Δ)

, where T represents the current timestamp or time reference point.

Δ

denotes the duration or length of the time interval, which defines the size of the window. In this paper, we apply time-based sliding windows to model trajectory data over road network as stated in Definition 1.

Definition 1 (Trajectory under Sliding Window).

A trajectory T of an object o, represented by the tuple

T 〈 o, Δ_{t}, P, G 〉

, contains a group of GPS points

T . P

traversing the road network G, generated within the last

Δ_{t}

time units.

To be more specific, let

T_{i}

be a trajectory within the trajectory set

T

.

T_{i} . P

can be denoted as

p_{i}^{1}, p_{i}^{2}, \dots, p_{i}^{n}

, respectively. With the passage of each time unit, a point is removed from

T . P

and another point is inserted into it. Each GPS point p in

T_{i} . P

is defined as

p 〈 e, v, d 〉

, indicating that p resides on edge e and arrives at vertex v with a distance of d.

The distance between two trajectories, T and

T^{'}

, is calculated using Equation (1). Here,

D (p_{i}, p_{i}^{'})

represents the distance between two GPS points

p_{i}

and

p_{i}^{'}

contained in T and

T^{'}

, respectively. This distance corresponds to the shortest distance between

p_{i}

and

p_{i}^{'}

across the road network G.

p_{i}

(or

p_{i}^{'}

) signifies the i-th generated GPS point in T (or

T^{'}

).

Take an example in Figure 1. Let

T_{2}

be the trajectory of an object

o_{2}

at the time unit

t_{0}

. It contains the set of the last four GPS points

{p_{2}^{1}, p_{2}^{2}, p_{2}^{3}, p_{2}^{4}}

generated by

o_{2}

. After one time unit is passed,

p_{2}^{1}

is removed from

T_{2} . P

, and

p_{2}^{5}

is inserted into

T_{2} . P

. In other words,

T_{2} . P

is updated to

{p_{2}^{2}, p_{2}^{3}, p_{2}^{4}, p_{2}^{5}}

at

t_{1}

. The distance between

p_{2}^{1}

and

p_{q}^{1}

equals 100, and D(

T_{q}, T_{2}

) is 1600 m at

t_{0}

. After one time unit,

p_{2}^{1}

is removed from

T_{2} . P

, and

p_{2}^{5}

is inserted

T_{2} . P

. D(

T_{2}, T_{q}

) is updated to 2300 m. In the following, we will formally introduce the problem definition.

D (T . P, T^{'} . P) = \sum_{j = s}^{j = e} D (p_{i}, p_{i}^{'})

(1)

Definition 2 (CKTRN).

Let

T = {T_{1}, T_{2}, . . ., T_{N}}

represent a set of N trajectories over road network G. We consider a query

q 〈 k, T_{q} 〉

with

T_{q}

being the query trajectory and k being a query parameter. The objective of a CKTRN is to monitor these N trajectories, and return k trajectories with the smallest scores to the system whenever one time unit is passed.

Back to the example in Figure 1. At

t_{0}

, the distances between

T_{q}

and these 3 trajectories are

{2000, 1600, 2000}

, respectively. As

k = 1

, the query result is

{T_{2}}

. After one time unit is passed, the distances between

T_{q}

and these three trajectories are updated to

{1200, 2300, 2000}

. The query result is updated to

{T_{1}}

.

In essence, the CKST identifies the k trajectories with the smallest scores and presents them as the top-k most similar trajectories to the query trajectory

T_{q}

. This process is performed periodically to ensure continuous monitoring and retrieval of the most similar trajectories over time. In this context, the score of each trajectory

T_{i}

, denoted as D(

T_{i}

), is determined by the distance between

T_{i}

and the query trajectory

T_{q}

. The CKST constantly computes and updates these scores based on the evolving positions of the trajectories in real time.

3. The Framework RNDLP

In this section, we introduce a novel framework called RNDLP (Road Network-based Distance Lower-bound-based Prediction) to address CKTRN over road networks. The structure of this section is organized as follows: we provide an overview of the framework initially. Subsequently, we detail the approach used to calculate distances among GPS points over road networks. Thirdly, we explain the partition-based initiation algorithm. Last in this section is the incremental maintenance algorithm.

3.1. The Framework Overview

Let

G = 〈 V, E 〉

represent the road network. The RNDLP framework initiates by pre-processing the road network, calculating the shortest paths between any pair of vertices in the vertex set V (Algorithm 1). This pre-processing involves constructing a hash table H to store the shortest path lengths between each vertex pair in V (lines 1–4). The construction process of this hash table will be explained in Section 3.2.

Algorithm 1: The Framework Overview

After construction, the RNDLP framework scans each trajectory, calculates the lower-bound and upper-bound distances among each trajectory

T_{i}

in the trajectory set

T

and the query trajectory

T_{q}

(lines 5–15). More precisely, an initial set R is formed, containing all trajectories. Then, for each trajectory

T_{i} \in T

, if the lower bound

\underset{̲}{b} (T_{i})

is greater than a threshold

θ_{k}

, it signifies that

T_{i}

cannot become a temporary query result trajectory at the current moment. In this scenario,

T_{i}

is removed from the set R to another set

R^{'}

. Here, the threshold

θ_{k}

corresponds to the k-th lowest score among all trajectories in

T

, and the manner of calculating

\underset{̲}{b} (T_{i})

and

\bar{b} (T_{i})

will be explained in the later section.

The aforementioned operations are iteratively executed until the number of trajectories in the set R is reduced to k. At this stage, the remaining k trajectories in set R are deemed the query result trajectories. Additionally, the framework predicts the moment when each trajectory has the potential to become a query result trajectory. In this paper, we introduce a pair-based algorithm to calculate

\underset{̲}{b} (T_{i})

and

\bar{b} (T_{i})

for each

T_{i}

, with a detailed explanation provided in Section 3.3. The prediction algorithm will be discussed in Section 3.4.

Intuitively, the earlier the prediction moment of a trajectory, the more crucial it is to the query trajectory, and the higher the running cost it incurs. To ensure the workload (or communication cost) of each server is as evenly distributed as possible, we construct a cost model based on the lower-bound score of each trajectory. We form this model to assess the running cost associated with each trajectory. Utilizing this model, we can allocate trajectories based on their expected running cost, thereby minimizing the workload differences among different servers. Additionally, we present a greedy algorithm to partition trajectories in

T

among a group of servers. The cost model and partition algorithm will be elaborated upon in Section 3.5 (lines 19–20).

Once sets R and the partition are formed, the incremental algorithm is executed. In comparison to the algorithm discussed in previous work, our proposed algorithm can dynamically adjust the threshold value

θ_{k}

. This dynamic adjustment aids in reducing the frequency of score calculations for trajectories in

L_{T}

. In addition, we should dynamically adjust the partition. Comprehensive explanations of the incremental maintenance algorithm will be presented in Section 3.6.

Cost Analysis. As we need to use a hash table to maintain the shortest path among all vertexes, the scale of the table is bounded by

O (| E^{2} |)

. As we should maintain the shortest path, the size of each path is bounded by

O (V)

. As we should maintain GPS points of each trajectory in the worst case, this part of space cost is bounded by

O (n \cdot N)

, with n and N being the number of GPS points contained in a trajectory and the number of trajectories. Accordingly, the overall space cost is bounded by

O (| V | \cdot | E^{2} | + n \cdot N)

.

We now analyze the running cost of our proposed algorithms. The running cost of calculating the shortest path is bounded by

O (1)

. Moreover, we use a binary search to find the prediction moment. The search time is bounded by

O (log n)

. As the cost of calculating the lower-bound/upper-bound score of a trajectory is

O (1)

, the prediction cost is

O (log n)

.

3.2. Hash-Based Score Calculation Algorithm

As previously mentioned, the distance between two trajectories is calculated as the sum of the shortest path distances between their corresponding GPS points within the road network. When the positions of the GPS points along a trajectory are updated, the trajectory’s distance needs to be recalculated, which involves multiple computations of shortest path lengths.

To mitigate this computational cost, we pre-compute the shortest path distances between all pairs of vertices in the road network and establish a hash table to store these pre-computed values. This approach offers the advantage that, once the index is constructed, the distance between two points can be efficiently computed by directly accessing the hash table-based index. In the following sections, we provide a detailed explanation of how this index is constructed and how the distance between two GPS points is computed.

Hash Table Construction. Given a road network

G 〈 V, E 〉

, our first task is to compute the shortest paths between all pairs of vertices in V. Once these calculations are complete, we generate a set of pairs

P {p (1, 2), p (1, 3), \dots, p (m - 1, m)}

, where each pair

P (i, j)

consists of two vertices, namely

v_{i}

and

v_{j}

in V. Its value equals the distance

v_{i}

and

v_{j}

within the road network G.

After forming these pairs, the subsequent step involves the construction of the hash table H. Specifically, we initiate an empty hash table H with the bucket size being

{| V |}^{2}

. Subsequently, we map each pair

p (i, j)

into H, utilizing a key computation through the following equation:

i \times | V | + j

.

I D (i, j) = i \times | V | + j

(2)

Hash-based Distance Calculation. Given two GPS points

p (e, v, d)

and

p^{'} (e, v, d)

, the process of calculating the distance between them across the road network G involves the following steps. Initially, we determine the key of the corresponding pair

P (p, p^{'})

, which corresponds to

p (e)

and

p^{'} (e)

, by employing Equation (2). Subsequently, we access the hash table H to retrieve the pair

P (p, p^{'})

, which allows us to acquire the distance between

p (v)

and

p^{'} (v)

. Finally, the computation of the distance between p and

p^{'}

is calculated based on Equation (3).

I D (i, j) = D (p (v), p^{'} (v)) + p (d) + p^{'} (d)

(3)

Back to the example in Figure 1. We map a set of pairs

P {p (0, 1), p (0, 2), \dots, p (0, 6)}

into H. The key results calculated by Equation (2) are shown in Figure 2. Given two GPS points

p_{1}^{4} (e 〈 v_{0}, v_{2}, 500 〉, v_{0}, 100)

and

p_{2}^{4} (e 〈 v_{1}, v_{4}, 400 〉, v_{4}, 100)

, the key of the pair

P (p_{1}^{4}, p_{2}^{4})

equals

I D (0, 4) = 0 \cdot | 7 | + 4 = 4

. We access the key to find the value from hash table H, i.e., we can obtain that the distance between

p_{1}^{4} (v_{0})

and

p_{2}^{4} (v_{1})

is 800 m. Finally, we can calculate the distance between

p_{1}^{4}

and

p_{2}^{4}

equals 800 + 100 + 100 = 1000 m.

3.3. The Pair-Based Lower-Bound Score Calculation

Our approach is built on a key insight. When calculating the lower-bound score of a trajectory, we can achieve more accurate results by considering both the start and end points of the trajectory, as highlighted in Lemma 1 and Lemma 2. For simplicity,

p_{i} (1, n)

refers to the pair constructed by

p_{i}^{1}, p_{i}^{n} \in T_{i} . P

, and

d (j) (1 \leq j \leq n)

refers to the distance between

p_{i} (j)

and

p_{q} (j)

.

Intuitively, when we calculate the lower-bound score of a trajectory, if we only consider one point within a trajectory, the difference between the lower-bound score and the real score may be very large, especially when the size of a trajectory is large. As a contrast, when we calculate the lower-bound score of a trajectory

T_{i}

via considering the start/end point of a trajectory, as the start point and end point of the virtual trajectory are the same as that of

T_{i}

, based on the spatio-temporal constraint, we can generate a more reasonable virtual trajectory. Accordingly, we can tighten up the lower-bound score

T_{i}

. Intuitively, as the cost model we form is based on the lower-bound score of each

T_{i}

, the pair-based lower-bound score calculation makes the model more workable.

Lemma 1.

Let

T_{q}

be the query trajectory, and

T_{i} \in T

be a trajectory. When

d_{1} \geq 2 t_{x} U_{m a x}

, the lower-bound score of

T_{i}

, i.e., denoted as

\underset{̲}{b}

(

T_{i}

), equals

(n - 1) (d_{1} - t_{x} U_{m a x}) + d_{n}

.

Proof.

We prove it via forming two virtual trajectories

V T_{i}

and

V T_{q}

based on

T_{i}

and

T_{q}

. These virtual trajectories simulate the movement of objects. Initially,

V T_{i}

and

V T_{q}

are located at

p_{i}^{1}

and

p_{q}^{1}

, respectively. After

t_{x}

time units, they move towards each other and reach

p_{i}^{n}

and

p_{q}^{n}

, respectively, by

T_{i} . e

. Along their paths, the distance between

v p_{i} (j)

and

v p_{q} (j)

is

d_{1} - 2 (j - 1) U_{m a x}

when

j \leq t_{x} + 1

, and

\frac{2 t_{x} U_{m a x} (j - n)}{n - t_{x} - 1} + d_{1}

when

j > t_{x} + 1

. Accordingly, the distance sum D(

V T_{i}

) is

(n - 1) (d_{1} - t_{x} U_{m a x}) + d_{n}

. □

Lemma 2.

Let

T_{q}

be the query trajectory, and

T_{i} \in T

be a trajectory, when

d_{1} \leq 2 t_{x} U_{m a x}

,

\underset{̲}{b}

(

T_{i}

) equals

\frac{d_{1}}{2} (\frac{d_{1}}{2 U_{m a x}} + n - t_{x} - 1) + d_{n}

.

Proof.

We prove it via forming another two virtual trajectories

V T_{i}^{'}

and

V T_{q}^{'}

based on

T_{i}

and

T_{q}

. Initially,

V T_{i}^{'}

and

V T_{q}^{'}

are located at

p_{i}^{1}

and

p_{q}^{1}

, respectively. After

\frac{d_{1}}{2 U m a x}

time units, they move towards each other and reach the same position

v p

. They stay at

v p

for

t_{x}^{'}

time units and then move to

p_{i}^{n}

and

p_{q}^{n}

, with

t_{x}^{'}

being

t_{x} - \frac{d_{1}}{2 U m a x}

. Under these two paths, the distance between

v p_{i} (j)

and

v p_{q} (j)

equals

d_{1} - 2 (j - 1) U m a x

when

j \leq \frac{d_{1}}{2 U m a x} + 1

. Accordingly, D(

V T_{i}

) equals

\frac{d_{1}}{2} (\frac{d_{1}}{2 U m a x} + n - t_{x} - 1) + d_{n}

. □

It is significant to find a suitable

t_{x}

. Take the example in Figure 3a. If

t_{x}

is set to 1, the object

o_{2}

arrives at

v p_{2}^{2}

using 1 time unit, and then arrives at

p_{2}^{8}

. The distance between

v p_{2}^{2}

and

p_{2}^{8}

equals

\sqrt{(100^{2} + 400^{2})}

, which is smaller than

(8 - 1) \cdot 100 = 700 m

. It can arrive at

p_{2}^{8}

before

T_{2} . e

. Thus, if

t_{x}

is set to 1, the lower-bound score is loose. If

t_{x}

is set to 2, the object

o_{2}

arrives at

v p_{2}^{3}

using 2 time units, and then arrives to

p_{2}^{8}

. In addition, D(

v p_{2}^{3}, v p_{q}^{3}

) = 2.8 km, D(

v p_{2}^{3}, p_{2}^{8}

)=

\sqrt{(200^{2} + 400^{2})}

, which is smaller than

(8 - 2) \cdot 100 = 600 m

. It can arrive at

p_{2}^{8}

at

T_{2} . e

. If

t_{x}

is set to 3, the object

o_{2}

arrives at

v p_{2}^{4}

via 3 time units, and then arrives at

p_{2}^{8}

. D(

v p_{2}^{4}, v p_{q}^{4}

) = 2.6 km, and D(

v p_{2}^{4}, p_{2}^{8}

)=

\sqrt{(300^{2} + 400^{2})}

, which equals

(8 - 3) \cdot 100 = 500 m

. Thus, it can arrive at

p_{2}^{8}

at

T_{2} . e

. Therefore,

t_{x}

is set to 3. Note, in implementation, we could use a binary search to find the maximal

t_{x}

. As it is simple, for the limitation of space, we will skip the details.

We also find that if the partition is applied, the corresponding lower-/upper-bound score could be further tightened. Accordingly, we propose a partition-based method to tighten the lower/upper-bound score of the trajectories. Formally, for each trajectory contained in the trajectory set

T

, we partition into a group of sub-trajectories such that

{T_{i} (1, m_{i}), T_{i} (2, m_{i}), . . . \dots, T_{i} (n_{i},

m_{i})}

. Here,

T_{i} (j, m_{i})

refers to the j-th sub-trajectory of

T_{i}

, and its scale equals

m_{i}

.

T_{i} (j, m_{i})

also contains a set of GPS points, which are the

{(j - 1) \cdot m_{i} + 1, (j - 1) \cdot m_{i} + 2, . . . j \cdot m_{i}}

-th generated GPS points in

T_{i}

.

n_{i}

refers to the number of sub-trajectories. Based on the partition result, we can update

\underset{̲}{b}

(

T_{i}

) and

\bar{b}

(

T_{i}

) to

\sum_{j = 1}^{j = n_{i}} \underset{̲}{b} (T_{i} (j, m_{i}))

and

\sum_{j = 1}^{j = n_{i}} \bar{b} (T_{i} (j, m_{i}))

, respectively. In addition,

\underset{̲}{b}

(

T_{i} (j, m_{i})

) and

\bar{b}

(

T_{i} (j, m_{i})

) could be calculated based on Theorem 2.

Back to the example in Figure 3a. If the partition amount is 1,

\underset{̲}{b}

(

T_{2}

) is calculated as the sum of the real location distance D(

p_{2}^{1}, p_{q}^{1}

), D(

p_{2}^{8}, p_{q}^{8}

) and the virtual location distance

\sum_{i = 2}^{i = 7} D (v p_{2}^{i}, v p_{q}^{i})

. Because

t_{x}

equals 3, D(

v p_{2}^{2}, v p_{q}^{2}

) = 3200 − 1·2·100 = 3 km, D(

v p_{2}^{3}, v p_{q}^{3}

) and D(

v p_{2}^{4}, v p_{q}^{4}

) is equal 2.8 km and 2.6 km, respectively. Then, D(

v p_{2}^{4}, p_{2}^{8}

) =

\sqrt{(300^{2} + 400^{2})} = 500 m

, divided into 4 parts. Each part is 125 m. So, D(

v p_{2}^{5}, v p_{q}^{5}

) is calculated as

2 \cdot \frac{300 \cdot 125}{500} + 2600 = 2750 m

, D(

v p_{2}^{6}, v p_{q}^{6}

) = 2900 m(=

2 \cdot \frac{300 \cdot 125 \cdot 2}{500}

+ 2600) and D(

v p_{2}^{7}, v p_{q}^{7}) = 3050 m (= 2 \cdot \frac{300 \cdot 125 \cdot 3}{500}

+ 2600). To sum up,

\underset{̲}{b}

(

T_{2}

) is 23.5 km and

\bar{b}

(

T_{2}

) is 27.7 km.

\underset{̲}{b}

(

T_{2}

), and

\bar{b}

(

T_{2}

) are updated to 25 km and 26.2 km when the partition amount is 2.

Discussion. Pair-based partition can provide trajectories with tighter lower-bound score. However, a natural question is how to form proper partitions for different trajectories. Intuitively, if the partition amount is large, the running cost of maintaining the partition is high, but the lower-/upper-bound score is tight. In contrast, the lower-bound score is loose. It is significant to find a flexible method to self-adaptively partition trajectories based on the distance relationship among them to the query trajectory.

3.4. The Pair-Based Prediction Algorithm

In this section, we first explain the pair-based partition. Then, we explain the prediction algorithm.

The Pair-based Partition Algorithms. It is to find query result trajectories via recursively partitioning trajectories and tightening their lower-/upper-bound scores. In this way, we can find query result trajectories, as well calculate the lower-bound score of other trajectories. Intuitively, once the lower-bound scores of trajectories are computed, we can use the model (discussed later) to evaluate the running cost we spend on it, which helps us effectively assign trajectories.

In this section, we use Theorem 2 for calculating the lower-bound score of sub-trajectories. Intuitively, if more pairs are considered for lower-bound score calculation, we could obtain a tighter lower-bound score. Accordingly, we propose the concept of

τ - L -

score. Here, we assume that each trajectory contains the set of n GPS points, with n being

2^{τ}

and

τ

being an integer. In addition, the lower-bound score of each trajectory, i.e., denoted as

\underset{̲}{b}

(

T_{i}, τ

), is computed based on Equation (4), with r being

\frac{n}{2^{τ}}

.

\underset{̲}{b} (T_{i}, τ) = \sum_{j = 1}^{j = 2^{τ}} \underset{̲}{b} (T_{i}^{r (j - 1) + 1, r j}, r (j - 1) + 1, r j)

(4)

We now formally explain the partition algorithm. Firstly, we access each

T_{i} \in T

, and compute

\underset{̲}{b}

(T_{i})

and

\bar{b}

(T_{i})

as the manner discussed in Theorem 2. After calculating the lower-bound scores/upper-bounds of all trajectories, we search

T_{k}

, i.e., the trajectory with the

k

-th lowest upper-bound score, use

\bar{b}

(

T_{k}

) for pruning trajectories in

T

. In other words, for each trajectory

T_{i} \in T

, its lower-bound score is larger than

\bar{b}

(

T_{k}

). It is not a query result trajectory, and we remove it to the set

T^{'}

.

For the reminders, we split each element in

T

into two sub-trajectories with an equal scale, and update their lower-bound score based on Equation (4). Again, for each of them, if existing k trajectories have an upper-bound score lower than it, it is removed from

T

to the set

T^{'}

. From then on, we repeat the above operations until the number of trajectories in

T

reduces to k. At that moment, we can use these k trajectories as query result trajectories in the current window. Note, during the partition, we can form the corresponding PDSP structure for each trajectory.

The PDSP-based Prediction. After we form the PDSP for each trajectory, we are going to predict the earliest moment, i.e., denoted as

T_{i} . t

, each trajectory

T_{i}

may become a query result trajectory. In this way, we need not to monitor

T_{i}

before

T_{i} . t

. Specially, we scan each trajectory

T_{i}

and calculate

T_{i} . t

based on Theorem 1. After calculation, we use a prior queue to maintain these trajectories based on

T_{i} . t

of each

T_{i}

. As the algorithm is simple, we skip the details to save space.

Theorem 1.

Trajectories

T_{i}, T_{k}

, and

T_{q}

, D(

T_{i}

) are no smaller than D(

T_{k}

) after δ time units with δ being computed based on the in-Equation

\underset{̲}{b}

(

T_{i}, τ

) ≤ D(

T_{k}

).

Proof.

We assume that

T_{i}

is partitioned into

m (= 2^{τ})

sub-trajectories

{T_{i} (1, r), T_{i} (r + 1, 2 r), \dots, T_{i} ((m - 1) r + 1, m r)}

. Theorem 1 is proved under the following two cases: (i)

δ > n

; (ii)

δ \leq n

. Under case (i), we use the manner discussed in Theorem 1 to find the maximal

δ

. Under cases (ii),

\underset{̲}{b}

(

T_{i}, τ

) equals

\underset{̲}{b}

(

T_{i}^{δ, j r}, δ, j r

)+

\underset{̲}{b}

(

T_{i} (n + 1, n + δ), p_{i}^{n}

)+

\sum_{u = j + 1}^{j = m}

\underset{̲}{b}

(

T_{i} (u r + 1, (u + 1) r)

), while

\bar{b}

(

T_{k}

) equals D(

T_{k} (n - δ, n)

)+

\bar{b}

(

T_{k} (n + 1, n + δ)

). We could find the maximal

δ

via the Equation

\underset{̲}{b}

(

T_{i}, τ) \leq D (T_{k}

). □

We want to highlight that, once

T_{i} . t

of each

T_{i}

is calculated, we can ensure that

T_{i}

cannot become a query result trajectory before

T_{i} . t

. In addition, we can evaluate the running cost we will spend on it. It is convenient to evaluate its importance to the query result sets and make a reasonable partition based on the prediction result.

3.5. Trajectory Set Partition Algorithm

As the number of trajectories is usually large, it is difficult to process a large number of trajectories over a single server. Thus, in this section, we study how to form a reason partition that partitions trajectories into a group of servers, making sure that the workload of each server is as similar as possible. In order to achieve this goal, we first form a model to evaluate the running cost of each trajectory.

Specially, let

T_{i}

be a trajectory in the trajectory set

T

,

T_{i} (C)

be the number of points we access when evaluating its prediction moment, and

T_{i} (t)

be the prediction moment of

T_{i}

. After n time units are passed, the excepted time we access

T_{i}

is

\frac{n}{W (t) - T_{i} (t)}

. Accordingly, the excepted running cost of maintaining

T_{i}

could be calculated via Equation (5). In addition, the excepted running cost of maintaining all trajectories could be computed based on Equation (6).

M O DEL = \frac{n}{W (t) - T_{i} (t)} \cdot T_{i} (C)

(5)

M O D - A L L = \sum_{i = 1}^{i = | T |} \frac{n}{W (t) - T_{i} (t)} \cdot T_{i} (C)

(6)

After explaining the cost model, we now formally explain the partition algorithm. Our goal is to make the workload of each server is as the same as possible.

Let

S

be the set of servers. For each server

S_{i} \in S

, we use

S_{i} (c)

to record the current workload of

S_{i}

, and servers in

S

are sorted in ascended order by their workload. We use the greedy algorithm to form the partition. Specially, we scan each trajectory

T_{i} \in T

, compute its excepted running cost via Equation (6), allocate it to the server

S_{i}

with minimal workload, and finally update

S_{i} (C)

.

3.6. The Incremental Maintenance Algorithms

In this section, we first discuss the incremental algorithm over each server, i.e., the local incremental maintenance algorithm. Its function is to update the prediction moment of trajectories in each server. Next, we explain the global incremental maintenance algorithm. Its function is to guarantee the workload of each server is as similar as possible.

3.6.1. The Local Incremental Maintenance Algorithm

In the following, we propose the algorithm INC-PDSP, which explains how PDSP is applied for supporting the data stream. Here,

I_{c}

and

I_{q}

are two inverted lists that maintain non-query result trajectories and query result trajectories, respectively. Our algorithm is proposed based on the following observation. That is, the prediction moment of a trajectory is computed based on the assumption that trajectories in

I_{q}

move far away from

T_{q}

. It is actually not true in real applications. A moment of thought could reveal that if

θ_{k}

is not rapidly increased, we can delay the prediction moment updating of trajectories. Here,

θ_{k}

refers to the k-th lowest score among all trajectories in the query result set.

η = \frac{U_{m a x} + θ_{k}^{0} - D (p_{k}^{1}, p_{q}^{1}) - θ_{k}}{U_{m a x}}

(7)

We now formally provide the algorithm details. We associate

I_{c}

with a variable named

I_{c} . g

with

I_{c}

being the inverted list that maintains all non-query result trajectories. It records the sum of time units we can delay. Its value is set to 0 at the moment

I_{c}

is constructed. Whenever one time unit is passed, we update

I_{c} . g

to

I_{c} . g + η

, where

η

is computed based on Equation (7). Here,

θ_{k}^{0}

refers to the k-th highest score among elements in

I_{q}

at the last time unit.

U_{m a x} + θ_{k}^{0} -

D(

p_{k}^{1}, p_{q}^{1}

) refers to the predicted score of

T_{k}

at the current time unit.

U_{m a x} + θ_{k}^{0} -

D(

p_{k}^{1}, p_{q}^{1}

)−

θ_{k}

refers to the difference between the predicted

θ_{k}

and real

θ_{k}

. In addition, we associate each element

T_{i} \in I_{c}

with a variable named

T_{i} . l

. Its value equals

I_{c} . g

at the moment

T_{i}

’s prediction moment is updated. Whenever one time unit is passed, we update

I_{c} . g

based on Equation (7). Next, we check whether the element

T_{i}

located at the top of

I_{c}

satisfying

T_{i} . t \leq t_{n o w} - I_{c} . g + T_{i} . l

. If the answer is yes, we update its prediction moment. Lastly, we set

T_{i} . l

to

I_{c} . g

.

3.6.2. The Global Incremental Maintenance Algorithm

We should monitor the workload of each server so as to guarantee the balance of the system. Thus, we should update

S_{i} (C)

of each server whenever one trajectory T in

S_{i}

is passed. Specially, let T be the trajectory we should evaluate. We first update its prediction moment. Next, we re-calculate its excepted running cost and update

S_{i} (C)

based on the re-calculating result.

The reason we monitor the workload of each server is we should re-allocate trajectories if the workload difference among different servers is so large. In our paper, if

S_{m a x} > 2 S_{m i n}

, we should re-allocate trajectories in the server

S_{m a x}

and

S_{m i n}

. Here,

S_{m a x}

and

S_{m i n}

refer to the server with maximal and minimal workload in

S

. When re-allocating, we scan trajectories in

S_{m a x}

and remove each scanned trajectory from

S_{m a x}

to

S_{m i n}

. When

S_{m a x}

is reduced to less than

S_{m i n}

, the algorithm is terminated.

4. Performance Evaluation

In this section, we conduct extensive experiments to demonstrate the efficiency of the RNDLP framework. The experiments are based on both real datasets and synthetic datasets. In the following, we first explain the datasets used in our experiments and the settings of our experiments and then report our findings.

4.1. Experiment Settings

Datasets. In total, four datasets are utilized in our experiments, comprising three real datasets: Beijing, Porto, and NYC, and a synthetic dataset named Normal. Beijing is sourced from the Microsoft T-Drive project, encompassing GPS trajectories of 10,357 taxis recorded from 2 February to 8 February 2008. Porto consists of 1,710,671 trajectories, describing the trajectory of 442 taxis from 7 January 2013 to 30 June 2014 in the city of Porto. NYC is obtained from the New York City Taxi and Limousine Commission, containing 2.36 GB of trip records. Accordingly, we generate a group of trajectories based on these records. Each record corresponds to a trajectory, where the start point and end point of the trajectory is set based on the pick-up point and drop-off point of the record. The trajectory is generated based on the shortest path based on the pick-up point and drop-off point of the record.

Normal is a synthetic trajectory dataset created by simulating the trajectories of moving objects on urban roads. The road network under these four datasets are shown in Table 1. We pre-process these datasets by retaining only four attributes, namely taxi ID, location longitude, location latitude, and timestamp. Note, the running time of our proposed framework is unrelated with the scale of the graph. The reason behind it is we use a hash table to maintain the shortest path among every two vertexes, and we can use

O (1)

running cost to caluate the distance between two point over road network.

Experimental Methodology. In our study, we first load the road network, the hash table, as well as all trajectories into the memory. Next, we scan each trajectory

T_{i}

in the trajectory set, compute its score or lower-bound score. After scanning, we assign these trajectories to different servers based on the model discussed before. From then on, we monitor the alarm time of each trajectory and update the scores of trajectories with their alarm time being the current time unit. When all trajectory are processed, we report the total running time.

Parameters. In our study, we evaluate the performance of different algorithms under four parameters: N, n, k, and

U_{\max}

. Here, N denotes the number of trajectories within the trajectory set

T

; n represents the number of GPS points within each trajectory; k is the input parameter for CKTRN; and

U_{\max}

denotes the maximal traveling speed of objects per time unit. The parameter settings are presented in Table 2, with the default values bolded.

Performance Metrics. The updating time is employed as the main performance metric. It refers to the average time used to process newly generated GPS points.

Competitors. In addition to the algorithm included in the RNDLP framework, we also implement a baseline algorithm named BASE. Note, the BASE algorithm updates scores of all trajectories whenever one time unit is passed. All the algorithms are implemented with C++, and all the experiments are conducted on 6226R CPU with 256 GB memory, running Microsoft Windows 10.

4.2. Performance Comparison

Updating Cost Comparison. In this section, we compare the performance of RNDLP with its competitors when supporting CKTRN under a data stream.

We present the running time of all algorithms under different k values in Figure 4a–d. Across all evaluated k values, RNDLP consistently outperforms BASE for all four datasets. For instance, in the PORTO dataset, the running time of RNDLP is only

0.58 %

of BASE, in the BEIJING dataset, it is

0.81 %

of BASE, and in the NORMAL dataset, it is

0.69 %

of BASE. The notable improvement arises from RNDLP considering spatio-temporal correlation among GPS points in each trajectory. The corresponding virtual trajectory is closer to the real trajectory, allowing it to provide trajectories with tight score bounds by accessing a small number of GPS points. In addition, RNDLP uses hash table to maintain the distance among every two vertexes over road network.

Furthermore, we report the running time of different algorithms under various N values in Figure 4e–h. As N increases, the running time of BASE sharply rises because BASE has to access all trajectories whenever the window slides. In contrast, RNDLP is not sensitive to N values thanks to the predictive nature of its employed algorithms. It does not need to maintain N trajectories in real time, resulting in a much more stable performance under various parameter settings. Besides the reasons discussed above, another important reason is we use a cost model to partition trajectories into different servers. In this way, the overhead of each server is roughly the same. As a contrast, BASE does not consider how to equally partition trajectories. Thus, in many cases, trajectories are skewed when distributed in each server. Thus, the total running cost of BASE is higher than that of RNDLP.

The running time of different algorithms under various n values is reported in Figure 4i–l. Here again, RNDLP outperforms its counterparts. Notably, BASE’s running time gradually increases with the growth of n since it has to access more GPS points when trajectories’ prediction moments are updated. Conversely, RNDLP’s running time does not change significantly with increased n values. This is because the larger the n, the larger the distance among starting points of objects’ trajectories to their end points. In this way, RNDLP could accurately calculate score bounds of trajectories via setting

τ

to a small value. In addition, as RNDLP could dynamically adjust prediction moment of trajectories, it could further reduce the cost of incremental maintenance.

The running time of different algorithms under different

U_{m a x}

is reported in Figure 4m–p. We find that RNDLP performed better again. In addition, BASE is not sensitive to

U_{m a x}

. The reason is BASE does not consider speed constraint. The running time of RNDLP gradually increases, but still spends much lower cost than the BASE algorithm. This is because the larger the

U_{m a x}

, the looser the lower-bound score BASE could provide. However, as

U_{m a x}

is usually not high in most applications, RNDLP is the most efficient in most cases (Table 3).

To sum up, RNDLP is both stable and efficient. It requires the lowest running time to support continuous k-similarity search over road network compared with the BASE algorithm.

5. Conclusions

In this paper, we propose a novel framework named SLBP to support continuous k-similarity trajectories search over road network. The framework can efficiently return query result trajectories by accessing only a few GPS points from a subset of trajectories within the entire trajectory set. Additionally, we propose a pair-based method to enhance algorithm performance. Through the calculation of lower-bound scores for trajectories, we observe that processing a small number of trajectories with each slide of the window is sufficient. As a result, our framework efficiently supports continuous k-similarity trajectories search over data streams. We conducted extensive experiments to evaluate the performance of our proposed algorithms on several datasets. The results consistently demonstrate the superior performance of our proposed algorithms.

Author Contributions

This research was jointly performed by H.J., S.T., R.Z. and B.W. Methodology, H.J.; resources, H.J.; writing—original draft, S.T.; supervision, R.Z. and B.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Social Science Planning Fund of Liaoning Province (No. L21BGL042).

Data Availability Statement

Publicly available datasets were analyzed in this study. The data can be found here: BEIJING: https://www.microsoft.com/en-us/research/publication/t-drive-trajectory-data-sample/ (accessed on 7 January 2024); PORTO: https://tianchi.aliyun.com/dataset/94216 (accessed on 7 January 2024); NYC: https://opendata.cityofnewyork.us/ (accessed 7 January 2024).

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

$T_{i}$	the trajectory
$T_{q}$	the query trajectory
D( $T_{i}$ )	the score of $T_{i}$
$\underset{̲}{b}$ ( $T_{i}$ )	the lower-bound score of $T_{i}$
$\bar{b}$ ( $T_{i}$ )	the upper-bound score of $T_{i}$
$p_{i}^{j}$	the j-th generated GPS point in $T_{i} . P$
$T_{i}^{α, β}$	a sub-trajectory of $T_{i}$ with first/last generated point $p_{i}^{α}$ / $p_{i}^{β}$
$\underset{̲}{b}$ ( $T_{i}, τ$ )	the lower-bound score of $T_{i}$ under $τ - L$ -score
$T_{i} . t$	the predicted moment of $T_{i}$

References

Chen, Y.; Zhang, H.; Sun, W.; Zheng, B. RNTrajRec: Road Network Enhanced Trajectory Recovery with Spatial-Temporal Transformer. In Proceedings of the 39th IEEE International Conference on Data Engineering, ICDE 2023, Anaheim, CA, USA, 3–7 April 2023; pp. 829–842. [Google Scholar] [CrossRef]
Du, Y.; Hu, Y.; Zhang, Z.; Fang, Z.; Chen, L.; Zheng, B.; Gao, Y. LDPTrace: Locally Differentially Private Trajectory Synthesis. Proc. VLDB Endow. 2023, 16, 1897–1909. [Google Scholar] [CrossRef]
Hwang, J.R.; Kang, H.Y.; Li, K.J. Spatio-temporal similarity analysis between trajectories on road networks. In Proceedings of the Perspectives in Conceptual Modeling: ER 2005 Workshops AOIS, BP-UML, CoMoGIS, eCOMO, and QoIS, Klagenfurt, Austria, 24–28 October 2005; Proceedings 24. Springer: Cham, Switzerland, 2005; pp. 280–289. [Google Scholar]
Tiakas, E.; Papadopoulos, A.; Nanopoulos, A.; Manolopoulos, Y.; Stojanovic, D.; Djordjevic-Kajan, S. Searching for similar trajectories in spatial networks. J. Syst. Softw. 2009, 82, 772–788. [Google Scholar] [CrossRef]
Kim, S.W.; Won, J.I.; Kim, J.D.; Shin, M.; Lee, J.; Kim, H. Path prediction of moving objects on road networks through analyzing past trajectories. In Proceedings of the Knowledge-Based Intelligent Information and Engineering Systems: 11th International Conference, KES 2007, XVII Italian Workshop on Neural Networks, Vietri sul Mare, Italy, 12–14 September 2007; Proceedings, Part I 11. Springer: Cham, Switzerland, 2007; pp. 379–389. [Google Scholar]
Jiang, J.; Xu, C.; Xu, J.; Xu, M. Route planning for locations based on trajectory segments. In Proceedings of the 2nd ACM SIGSPATIAL, San Francisco, CA, USA, 31 October 2016; pp. 1–8. [Google Scholar]
Li, T.; Chen, L.; Jensen, C.S.; Pedersen, T.B. TRACE: Real-time Compression of Streaming Trajectories in Road Networks. Proc. VLDB Endow. 2021, 14, 1175–1187. [Google Scholar] [CrossRef]
Li, T.; Chen, L.; Jensen, C.S.; Pedersen, T.B.; Gao, Y.; Hu, J. Evolutionary Clustering of Moving Objects. In Proceedings of the 2022 IEEE 38th ICDE, Kuala Lumpur, Malaysia, 9–12 May 2022; pp. 2399–2411. [Google Scholar]
Li, T.; Huang, R.; Chen, L.; Jensen, C.S.; Pedersen, T.B. Compression of Uncertain Trajectories in Road Networks. Proc. VLDB Endow. 2020, 13, 1050–1063. [Google Scholar] [CrossRef]
Wu, J.; Li, T.; Chen, L.; Gao, Y.; Wei, Z. SEA: A Scalable Entity Alignment System. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2023, Taipei, Taiwan, 23–27 July 2023; Chen, H., Duh, W.E., Huang, H., Kato, M.P., Mothe, J., Poblete, B., Eds.; ACM: New York, NY, USA, 2023; pp. 3175–3179. [Google Scholar] [CrossRef]
Tang, L.A.; Zheng, Y.; Yuan, J.; Han, J.; Leung, A.; Hung, C.C.; Peng, W.C. On discovery of traveling companions from streaming trajectories. In Proceedings of the 2012 IEEE 28th ICDE, Arlington, VA, USA, 1–5 April 2012; pp. 186–197. [Google Scholar]
Yang, X.; Wang, B.; Yang, K.; Liu, C.; Zheng, B. A Novel Representation and Compression for Queries on Trajectories in Road Networks. IEEE Trans. Knowl. Data Eng. 2018, 30, 613–629. [Google Scholar] [CrossRef]
Sacharidis, D.; Skoutas, D.; Skoumas, G. Continuous monitoring of nearest trajectories. In Proceedings of the 22nd ACM SIGSPATIAL, Dallas, TX, USA, 4–7 November 2014; pp. 361–370. [Google Scholar]
Yu, X.; Zhu, S.; Ren, Y. Continuous trajectory similarity search with result diversification. Future Gener. Comput. Syst. 2023, 143, 392–400. [Google Scholar] [CrossRef]
Pan, Z.; Chao, P.; Fang, J.; Chen, W.; Xu, J.; Zhao, L. Garden: At real-time processing framework for continuous top-k trajectory similarity search. Knowl. Inf. Syst. 2023, 65, 3777–3805. [Google Scholar] [CrossRef]
Zhang, D.; Chang, Z.; Wu, S.; Yuan, Y.; Tan, K.; Chen, G. Continuous Trajectory Similarity Search for Online Outlier Detection. IEEE Trans. Knowl. Data Eng. 2022, 34, 4690–4704. [Google Scholar] [CrossRef]
Jin, P.; Cui, T.; Wang, Q.; Jensen, C.S. Effective similarity search on indoor moving-object trajectories. In Proceedings of the Database Systems for Advanced Applications: 21st International Conference, DASFAA 2016, Dallas, TX, USA, 16–19 April 2016; Proceedings, Part II 21. Springer: Cham, Switzerland, 2016; pp. 181–197. [Google Scholar]
Tang, J.; Deng, M.; Huang, J.; Liu, H.; Chen, X. An automatic method for detection and update of additive changes in road network with GPS trajectory data. ISPRS Int. J. Geo-Inf. 2019, 8, 411. [Google Scholar] [CrossRef]
Zhu, R.; Xiao, M.; Wang, B.; Yang, X.; Xia, X.; Zong, C.; Qiu, T. Continuous k-Similarity Trajectories Search over Data Stream. In Lecture Notes in Computer Science, Proceedings of the Database Systems for Advanced Applications—28th International Conference, DASFAA 2023, Tianjin, China, 17–20 April 2023; Proceedings, Part I; Wang, X., Sapino, M.L., Han, W., Abbadi, A.E., Dobbie, G., Feng, Z., Shao, Y., Yin, H., Eds.; Springer: Cham, Switzerland, 2023; Volume 13943, pp. 273–282. [Google Scholar] [CrossRef]
Güting, R.H.; Behr, T.; Xu, J. Efficient k-nearest neighbor search on moving object trajectories. VLDB J. 2010, 19, 687–714. [Google Scholar] [CrossRef]
Li, Y.S.; Li, T.Y.; Zhou, J.G.; Huang, B.N. Double-consensus based distributed optimal energy management for multiple energy hubs. Appl. Sci. 2018, 8, 1412. [Google Scholar] [CrossRef]
Chen, L.; Özsu, M.T.; Oria, V. Robust and fast similarity search for moving object trajectories. In Proceedings of the 2005 ACM SIGMOD, Baltimore, MD, USA, 14–16 June 2005; pp. 491–502. [Google Scholar]
Gawde, G.; Pawar, J. Similarity search of time series trajectories based on shape. In Proceedings of the ACM India Joint International Conference on Data Science and Management of Data, Goa, India, 11–13 January 2018; pp. 340–343. [Google Scholar]
Zhu, R.; Wang, B.; Yang, X.; Zheng, B.; Wang, G. SAP: Improving Continuous Top-K Queries Over Streaming Data. IEEE Trans. Knowl. Data Eng. 2017, 29, 1310–1328. [Google Scholar] [CrossRef]

Figure 1. Continuous k-similarity trajectory search over road network (

k = 1

), where (a) denotes the initial case of a given trajectory; (b) denotes the update of the trajectory and query results after one unit of time has elapsed.

Figure 1. Continuous k-similarity trajectory search over road network (

k = 1

), where (a) denotes the initial case of a given trajectory; (b) denotes the update of the trajectory and query results after one unit of time has elapsed.

Figure 2. Hash table construction.

Figure 3. Pair-based lower-bound score calculation, where (a) and (b) respectively represent the graphical depiction of the distance from

o_{2}

to

p_{2}^{8}

over different time units.

Figure 3. Pair-based lower-bound score calculation, where (a) and (b) respectively represent the graphical depiction of the distance from

o_{2}

to

p_{2}^{8}

over different time units.

Figure 4. Running time comparison of different algorithms under different datasets.

Table 1. Road network information.

Datasets	Number of Vertices	Number of Edges
PORTO	114,099	1,507,611
BEIJING	54,198	126,827
NYC	264,346	733,846
NORMAL	1000	499,500

Table 2. Parameter settings.

Parameter	Value
N	200 KB, 400 KB, 600 KB, 800 K, 1 M
n	200, 400, 600, 800, 1 K
k	20, 40, 60, 80, 100
$U_{m a x}$	20 km/h, 30 km/h, 40 km/h, 50 km/h, 60 km/h

Table 3. Performance analysis of different similarity measures.

		Running Time Analysis(s)
Measure	Algorithm	PORTO NORMAL BEIJING NYC
ED-SUM	HRZ	0.36 0.38 0.37 0.37
	`RNDLP`	0.0007 0.002 0.0008 0.003
ED-Max	HRZ	0.19 0.35 0.23 0.36
	`RNDLP`	0.04 0.12 0.04 0.12
Frechet	HRZ	43.35 50.15 34.27 47.63
	`RNDLP`	2.69 2.78 2.77 2.72

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jiang, H.; Tong, S.; Zhu, R.; Wei, B. RNDLP: A Distributed Framework for Supporting Continuous k-Similarity Trajectories Search over Road Network. Mathematics 2024, 12, 270. https://doi.org/10.3390/math12020270

AMA Style

Jiang H, Tong S, Zhu R, Wei B. RNDLP: A Distributed Framework for Supporting Continuous k-Similarity Trajectories Search over Road Network. Mathematics. 2024; 12(2):270. https://doi.org/10.3390/math12020270

Chicago/Turabian Style

Jiang, Hong, Sainan Tong, Rui Zhu, and Baoze Wei. 2024. "RNDLP: A Distributed Framework for Supporting Continuous k-Similarity Trajectories Search over Road Network" Mathematics 12, no. 2: 270. https://doi.org/10.3390/math12020270

APA Style

Jiang, H., Tong, S., Zhu, R., & Wei, B. (2024). RNDLP: A Distributed Framework for Supporting Continuous k-Similarity Trajectories Search over Road Network. Mathematics, 12(2), 270. https://doi.org/10.3390/math12020270

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

RNDLP: A Distributed Framework for Supporting Continuous k-Similarity Trajectories Search over Road Network

Abstract

1. Introduction

2. Preliminary

2.1. Related Works

2.2. Problem Definition

3. The Framework RNDLP

3.1. The Framework Overview

3.2. Hash-Based Score Calculation Algorithm

3.3. The Pair-Based Lower-Bound Score Calculation

3.4. The Pair-Based Prediction Algorithm

3.5. Trajectory Set Partition Algorithm

3.6. The Incremental Maintenance Algorithms

3.6.1. The Local Incremental Maintenance Algorithm

3.6.2. The Global Incremental Maintenance Algorithm

4. Performance Evaluation

4.1. Experiment Settings

4.2. Performance Comparison

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI