RNDLP: A Distributed Framework for Supporting Continuous k-Similarity Trajectories Search over Road Network

: Continuous k -similarity trajectories search over a data stream is an important problem in the domain of spatio-temporal databases. Given a set of trajectories T and a query trajectory T q over road network G , the system monitors trajectories within T , reporting k trajectories that are the most similar to T q whenever one time unit is passed. Some existing works study k -similarity trajectories search over trajectory data, but they cannot work in a road network environment, especially when the trajectory set scale is large. In this paper, we propose a novel framework named RNDLP (Road Network-based Distance Lower-bound-based Prediction) to support CKTRN over trajectory data. It is a distributed framework based on the following observation. That is, given a trajectory T i and the query trajectory T q , when we have knowledge of D ( T i ), we can compute the lower-bound and upper-bound distances between T q and T i , which enables us to predict the scores of trajectories in T and employ these predictions to assess the significance of trajectories within T . Accordingly, we can form a mathematical model to evaluate the excepted running cost of each trajectory we should spend. Based on the model, we propose a partition algorithm to partition trajectories into a group of servers so as to guarantee that the workload of each server is as the same as possible. In each server, we propose a pair-based algorithm to predict the earliest time T i could become a query result, and use the predicted result to organize these trajectories. Our proposed algorithm helps us support query processing via accessing a few points of a small number of trajectories whenever trajectories are updated. Finally, we conduct extensive performance studies on large, real, and synthetic datasets, which demonstrate that our new framework could efficiently support CKST over a data stream.


Introduction
This paper addresses the challenge of continuous k-similarity trajectories (abbreviated as CKTRN) search over road networks, a problem with various applications [1][2][3][4][5].Notably, CKTRN finds applications in diverse domains [6][7][8][9].For instance, it proves beneficial in identifying compressed representations of trajectories while preserving their essential characteristics, leading to reduced storage requirements and transmission costs.Moreover, CKTRN plays a pivotal role in traffic analysis by uncovering trajectories with consistent patterns and behaviors.This information is valuable for predicting congestion, understanding traffic flow, and optimizing road networks.Lastly, it facilitates the clustering and grouping of moving objects exhibiting similar movement patterns, providing insights for urban planning, traffic analysis, and beyond.
Let G⟨V, E⟩ represent the road network with V being the vertex set and E the edge set [1,10].Each edge e is represented by the tuple ⟨v s , v e , w⟩, where v s and v e denote the starting and ending points of the edge e, respectively, and w refers to the weight of e, which equals the distance between v s and v e within G. Consequently, the trajectory of a moving object o across G is defined as the tuple T⟨o, P, n⟩.This includes a collection of n points p 1 , p 2 , • • • , p n generated by o over the last n time units.Each point p ∈ P within G is represented as p⟨e, v, d⟩, referring to the fact that p is positioned on edge e, arriving at vertex v after covering a distance of d.In this paper, points in P are modeled by a time-based window [11].Under this setting, points are generated during the last n time unit.Whenever one time unit is passed, the first point p 1 ∈ P could be regarded as an expired point, and we remove it from P. A newly generated point is inserted into P.
Let T q represent a query trajectory [1,2,12], which monitors a set of trajectories denoted as T .Whenever one time unit is passed, it performs a search among the trajectories in T and returns the k trajectories with the lowest scores to the system.In this context, the score of a trajectory T i ∈ T , denoted as D(T i ), is determined by the distance between T i and the query trajectory T q .Given a trajectory T ∈ T and a query trajectory T q , the distance between the corresponding point T(p i ) and T q (p i ) is defined as the shortest distance between these points within the road network G.The distance between trajectory T and query trajectory T q is calculated as the sum of distances among their corresponding points.
Take an example in Figure 1a.There are three trajectories {T 1 , T 2 , T 3 } contained in T .Each trajectory T i contains four GPS points generated by the moving object o i during the last four time units, i.e., T 2 contains four GPS points generated by the moving object o 2 , which are {p 1  2 , p 2 2 , p 3 2 , p 4 2 }.T q is the query trajectory, p 1 q ⟨e⟨v 0 , v 1 , 500⟩, v 1 , 100⟩ refers to the fact that p 1 q is the first GPS point of trajectory T q and is positioned on edge e⟨v 0 , v 1 , 500⟩, arriving at vertex v 1 after covering a distance of 100 m.Assume that the moving objects can travel up to 100 m per time unit.The distance between T 2 and T q equals 1600 (=100 + 300 + 500 + 700).The distances among T q and these three trajectories are {2000, 1600, 2000}, respectively.As k = 1, the query result is {T 2 }.As shown in Figure 1b, after one time unit is passed, points in T 2 are updated to {p 2 2 , p 3 2 , p 4 2 , p 5 2 }, the distance between T q and these three trajectories are updated to {1200, 2300, 2000}, respectively, and the query results are updated to {T 1 }.This approach also holds significant practical applications.For instance, it can play a crucial role in a real-time anti-tracking system.Specifically, a user u might submit a request to the system to check if there are any existing trajectories tracking u.The system can fulfill this request by identifying k trajectories that are most similar to the real-time trajectory generated by u.If certain trajectories closely resemble this trajectory, it suggests potential tracking of u, and the system can promptly relay this information to u.Additionally, CKST has the potential for extensions into other real-time systems such as online car sharing [13], popular route identification, and more.
Numerous researchers have studied the problem of k-similarity trajectories search [14][15][16][17][18].However, a major portion of these endeavors have primarily concentrated on addressing k-similarity trajectories search over static or historical trajectory datasets.Sacharidis et al. [13] explored the CKTRN problem within the context of data streams, enabling the retrieval of similar trajectories in real time.Yet, their approach measures similarity by considering the distance between two representative points corresponding to these trajectories (as discussed in Section 2).This method often fails to accurately evaluate trajectory similarity in various scenarios.Another effort proposed by Zhu et al. [19] investigated CKTRN over GPS pointbased data streams.However, the similarity between two trajectories is measured based on the Euclidean distance among the corresponding GPS points, which cannot effectively work under a road network.Therefore, an efficient algorithm that could both accurately evaluate similarity among trajectories and support CKTRN in real time is desired.
However, efficiently supporting CKTRN over a road network poses several challenges.Firstly, the scale of each trajectory is typically large.In the context of data streams, where trajectory points are frequently updated, efficiently updating trajectory scores in real time becomes a challenging task.Secondly, the trajectory set scale is also extensive.As the window slides (with each time unit passing), updating scores for all trajectories and efficiently identifying new query result trajectories from a large set of trajectories involve a substantial computational cost.Moreover, maintaining all trajectories in a single server is challenging.Thirdly, in the context of road network environments, the computational overhead involved in calculating the distance between two points introduces its own set of challenges, further escalating the overall computational cost.
In this paper, we introduce a novel framework called RNDLP (Road Network-based Distance Lower-bound-based Prediction) designed to support CKTRN over road networks.This framework is distributed and relies on two key observations.Firstly, considering a road network G⟨V, E⟩, both edges and vertices in E and V do not frequently update.Consequently, we can pre-calculate the shortest distances among all vertices in G and employ these pre-computed values to streamline distance calculations among trajectory points.
Secondly, considering a non-query result trajectory T i and the query trajectory T q , when we have knowledge of D(T i ), we can compute the lower-bound and upper-bound distances between T q and T i .This computation enables us to predict the scores of trajectories in T and use these predictions to assess the significance of trajectories within T .Essentially, if the lower-bound score of T i remains consistently high over numerous time units, it indicates that this trajectory is not likely to become a query result for an extended period.As a result, such a trajectory holds lower importance, and there is no need to closely monitor it over the long term.Conversely, trajectories with fluctuating scores require more frequent monitoring.In essence, only a small subset of trajectories needs score tracking with each passing time unit.In summary, our contributions can be outlined as follows.

•
Hash-based Distance Calculation.We introduce a hash-based index to manage distances between points within the road network G. Specifically, we pre-compute the distances between vertices using the Floyd algorithm and use a hash table to maintain distance among any two vertices.In this way, we need not to spend high running costs in calculating the distance between two points.Alternatively, we can use O(1) running cost in computing the distance between two GPS points over a road network.• Pair-based Dynamic Prediction Algorithm.We introduce a novel algorithm called PAIRDP (short for PAIR-based Dynamic Prediction) as an enhancement of the PDSP algorithm discussed in [20].PAIRDP brings improvements in two main aspects.Firstly, it harnesses the inherent spatiotemporal correlation in GPS points to improve the accuracy of predicting the optimal moment for trajectories to potentially become query results.This correlation contributes to refining the prediction process and achieving more precise results.Secondly, PAIRDP incorporates a dynamic adjustment mechanism for predicting the moments when trajectories could potentially become query results.This adjustment relies on the scores of the query result trajectories.By integrating this dynamic adjustment, the algorithm can significantly reduce the frequency of trajectory access, resulting in improved operational efficiency.
• Model-based Partition Algorithm.Utilizing the prediction results, we can establish a cost model to assess the anticipated running cost of each trajectory.Consequently, we introduce a greedy algorithm to partition trajectories into different servers, ensuring that the workload of each server is as evenly distributed as possible.Additionally, we propose an incremental maintenance algorithm to adapt the partition under a data stream.
The remainder of this paper is structured as follows.Section 2 provides an overview of the existing literature in the field and outlines the problem definition.Section 3 introduces our proposed framework.In Section 4, we present the outcomes of our comprehensive experimental evaluation.Lastly, Section 5 offers concluding remarks by summarizing our key findings.

Preliminary
In this section, we will first review some important existing results related to ksimilarity trajectory search.We will then introduce the problem definition.

Related Works
In recent years, researchers have focused on addressing the challenge of trajectory similarity search [8,10,21].The endeavors in this domain can be categorized into two parts: ad hoc k-similarity trajectories queries and continuous k-similarity trajectories queries.Ad hoc k-similarity trajectory queries concentrating on enhancing query result accuracy through the design of similarity functions.For example, Lei Chen et al. [22] devised the EDR (Edit Distance on Real sequence) similarity measure function.Gajanan Gawde [23] leveraged trajectory polygon shapes for similarity comparisons.
The continuous k-similarity trajectories query can be further categorized into historical trajectory data-based and streaming trajectory data-based methods.Historical trajectory data-based approaches, exemplified by Güting et al. [20], utilize spatio-temporal indexes like R-trees to support query processing.They consider each trajectory as a sequence of units, constructing an index structure to facilitate a k-nearest neighbor search.On the other hand, streaming trajectory data-based methodologies, as studied by Sacharidis et al. [13], handle the continuous updating of trajectory data.These methods evaluate the distance between a query object and other mobile objects within a time window based on their maximum or minimum distance over timestamps.However, this approach utilizes only a subset of GPS points to represent trajectory distances, potentially affecting query precision.
Zhu et al. [19] proposed a sketch-based prediction algorithm to support a GPS pointbased k-similarity trajectories search.The algorithm is based on the following observation.That is, if the distance between the last points of the query trajectory, denoted as p n q , and the corresponding point in a trajectory T i is large, the overall distance of T i to T q , denoted as D(T i ), will also be large.The authors provide formal bounds for the upper bound (b(T i )) and lower bound (b(T i )) of T i based on this observation.In addition to the score bounds calculation, Zhu et al. propose a structure called Partition-based Distance Sketch (PDS).The PDS is designed to summarize the distance distribution among GPS points in each trajectory T i and the query trajectory T q .It provides a compact representation of the distances between points within each trajectory, enabling efficient prediction and query result determination.Using the PDS, the algorithm can avoid continuously monitoring every trajectory until their predicted moments (T i .t).Consequently, it can save lots of running costs, and the corresponding GPS points generated within the time interval [T i .s,T i .t− 1] can be safely deleted, further reducing the storage requirements.Discussion.In summary, most algorithms are based on the historical trajectory database, which cannot support query processing in real time.The algorithm proposed by Dimitris et al. cannot accurately evaluate similarity among trajectories in many cases.The effort proposed by Zhu et al. is based on GPS points.Therefore, an efficient algorithm that could both accurately evaluate similarity among trajectories and support the k-similar trajectories search in real time is desired.

Problem Definition
A time-based [24] window can be formally defined as a data organization and management technique that partitions a continuous stream of data into fixed or variable time intervals.Formally, a time-based window can be represented as a tuple W(T, ∆), where T represents the current timestamp or time reference point.∆ denotes the duration or length of the time interval, which defines the size of the window.In this paper, we apply time-based sliding windows to model trajectory data over road network as stated in Definition 1.

Definition 1 (Trajectory under Sliding Window).
A trajectory T of an object o, represented by the tuple T⟨o, ∆ t , P, G⟩, contains a group of GPS points T.P traversing the road network G, generated within the last ∆ t time units.
To be more specific, let T i be a trajectory within the trajectory set T .T i .P can be denoted as . With the passage of each time unit, a point is removed from T.P and another point is inserted into it.Each GPS point p in T i .P is defined as p⟨e, v, d⟩, indicating that p resides on edge e and arrives at vertex v with a distance of d.
The distance between two trajectories, T and T ′ , is calculated using Equation ( 1).Here, D(p i , p ′ i ) represents the distance between two GPS points p i and p ′ i contained in T and T ′ , respectively.This distance corresponds to the shortest distance between p i and p ′ i across the road network G. p i (or p ′ i ) signifies the i-th generated GPS point in T (or T ′ ).Take an example in Figure 1.Let T 2 be the trajectory of an object o 2 at the time unit t 0 .It contains the set of the last four GPS points {p 1 2 , p 2 2 , p 3 2 , p 4 2 } generated by o 2 .After one time unit is passed, p 1 2 is removed from T 2 .P, and p 5 2 is inserted into T 2 .P. In other words, T 2 .P is updated to {p 2 2 , p 3 2 , p 4 2 , p 5 2 } at t 1 .The distance between p 1 2 and p 1 q equals 100, and D(T q , T 2 ) is 1600 m at t 0 .After one time unit, p 1 2 is removed from T 2 .P, and p 5 2 is inserted T 2 .P. D(T 2 , T q ) is updated to 2300 m.In the following, we will formally introduce the problem definition.

D(T.P, T
Definition 2 (CKTRN).Let T = {T 1 , T 2 , ..., T N } represent a set of N trajectories over road network G.We consider a query q⟨k, T q ⟩ with T q being the query trajectory and k being a query parameter.The objective of a CKTRN is to monitor these N trajectories, and return k trajectories with the smallest scores to the system whenever one time unit is passed.
Back to the example in Figure 1.At t 0 , the distances between T q and these 3 trajectories are {2000, 1600, 2000}, respectively.As k = 1, the query result is {T 2 }.After one time unit is passed, the distances between T q and these three trajectories are updated to {1200, 2300, 2000}.The query result is updated to {T 1 }.
In essence, the CKST identifies the k trajectories with the smallest scores and presents them as the top-k most similar trajectories to the query trajectory T q .This process is performed periodically to ensure continuous monitoring and retrieval of the most similar trajectories over time.In this context, the score of each trajectory T i , denoted as D(T i ), is determined by the distance between T i and the query trajectory T q .The CKST constantly computes and updates these scores based on the evolving positions of the trajectories in real time.

The Framework RNDLP
In this section, we introduce a novel framework called RNDLP (Road Network-based Distance Lower-bound-based Prediction) to address CKTRN over road networks.The structure of this section is organized as follows: we provide an overview of the framework initially.Subsequently, we detail the approach used to calculate distances among GPS points over road networks.Thirdly, we explain the partition-based initiation algorithm.Last in this section is the incremental maintenance algorithm.

The Framework Overview
Let G = ⟨V, E⟩ represent the road network.The RNDLP framework initiates by preprocessing the road network, calculating the shortest paths between any pair of vertices in the vertex set V (Algorithm 1).This pre-processing involves constructing a hash table H to store the shortest path lengths between each vertex pair in V (lines 1-4).The construction process of this hash table will be explained in Section 3.2.

Algorithm 1: The Framework Overview
Input: G = (V, E), trajectory set T, query trajectory T q , k, current window W Output: Query result set Q, lists I q , I c Hash map M ← DSP(i, j); 18 Partition(T ); 19 Return Q, I q , I c ; After construction, the RNDLP framework scans each trajectory, calculates the lowerbound and upper-bound distances among each trajectory T i in the trajectory set T and the query trajectory T q (lines 5-15).More precisely, an initial set R is formed, containing all trajectories.Then, for each trajectory T i ∈ T , if the lower bound b(T i ) is greater than a threshold θ k , it signifies that T i cannot become a temporary query result trajectory at the current moment.In this scenario, T i is removed from the set R to another set R ′ .Here, the threshold θ k corresponds to the k-th lowest score among all trajectories in T , and the manner of calculating b(T i ) and b(T i ) will be explained in the later section.
The aforementioned operations are iteratively executed until the number of trajectories in the set R is reduced to k.At this stage, the remaining k trajectories in set R are deemed the query result trajectories.Additionally, the framework predicts the moment when each trajectory has the potential to become a query result trajectory.In this paper, we introduce a pair-based algorithm to calculate b(T i ) and b(T i ) for each T i , with a detailed explanation provided in Section 3.3.The prediction algorithm will be discussed in Section 3.4.
Intuitively, the earlier the prediction moment of a trajectory, the more crucial it is to the query trajectory, and the higher the running cost it incurs.To ensure the workload (or communication cost) of each server is as evenly distributed as possible, we construct a cost model based on the lower-bound score of each trajectory.We form this model to assess the running cost associated with each trajectory.Utilizing this model, we can allocate trajectories based on their expected running cost, thereby minimizing the workload differences among different servers.Additionally, we present a greedy algorithm to partition trajectories in T among a group of servers.The cost model and partition algorithm will be elaborated upon in Section 3.5 (lines [19][20].
Once sets R and the partition are formed, the incremental algorithm is executed.In comparison to the algorithm discussed in previous work, our proposed algorithm can dynamically adjust the threshold value θ k .This dynamic adjustment aids in reducing the frequency of score calculations for trajectories in L T .In addition, we should dynamically adjust the partition.Comprehensive explanations of the incremental maintenance algorithm will be presented in Section 3.6.
Cost Analysis.As we need to use a hash table to maintain the shortest path among all vertexes, the scale of the table is bounded by O(|E 2 |).As we should maintain the shortest path, the size of each path is bounded by O(V).As we should maintain GPS points of each trajectory in the worst case, this part of space cost is bounded by O(n • N), with n and N being the number of GPS points contained in a trajectory and the number of trajectories.Accordingly, the overall space cost is bounded by We now analyze the running cost of our proposed algorithms.The running cost of calculating the shortest path is bounded by O(1).Moreover, we use a binary search to find the prediction moment.The search time is bounded by O(log n).As the cost of calculating the lower-bound/upper-bound score of a trajectory is O(1), the prediction cost is O(log n).

Hash-Based Score Calculation Algorithm
As previously mentioned, the distance between two trajectories is calculated as the sum of the shortest path distances between their corresponding GPS points within the road network.When the positions of the GPS points along a trajectory are updated, the trajectory's distance needs to be recalculated, which involves multiple computations of shortest path lengths.
To mitigate this computational cost, we pre-compute the shortest path distances between all pairs of vertices in the road network and establish a hash table to store these precomputed values.This approach offers the advantage that, once the index is constructed, the distance between two points can be efficiently computed by directly accessing the hash table-based index.In the following sections, we provide a detailed explanation of how this index is constructed and how the distance between two GPS points is computed.Construction.Given a road network G⟨V, E⟩, our first task is to compute the shortest paths between all pairs of vertices in V. Once these calculations are complete, we generate a set of pairs P{p(1, 2), p(1, 3), • • • , p(m − 1, m)}, where each pair P(i, j) consists of two vertices, namely v i and v j in V. Its value equals the distance v i and v j within the road network G.

Hash Table
After forming these pairs, the subsequent step involves the construction of the hash table H. Specifically, we initiate an empty hash table H with the bucket size being |V| 2 .Subsequently, we map each pair p(i, j) into H, utilizing a key computation through the following equation: i × |V| + j.
Hash-based Distance Calculation.Given two GPS points p(e, v, d) and p ′ (e, v, d), the process of calculating the distance between them across the road network G involves the following steps.Initially, we determine the key of the corresponding pair P(p, p ′ ), which corresponds to p(e) and p ′ (e), by employing Equation (2).Subsequently, we access the hash table H to retrieve the pair P(p, p ′ ), which allows us to acquire the distance between p(v) and p ′ (v).Finally, the computation of the distance between p and p ′ is calculated based on Equation (3).
p(i, j)

The Pair-Based Lower-Bound Score Calculation
Our approach is built on a key insight.When calculating the lower-bound score of a trajectory, we can achieve more accurate results by considering both the start and end points of the trajectory, as highlighted in Lemma 1 and Lemma 2. For simplicity, p i (1, n) refers to the pair constructed by p 1 i , p n i ∈ T i .P, and d(j)(1 ≤ j ≤ n) refers to the distance between p i (j) and p q (j).
Intuitively, when we calculate the lower-bound score of a trajectory, if we only consider one point within a trajectory, the difference between the lower-bound score and the real score may be very large, especially when the size of a trajectory is large.As a contrast, when we calculate the lower-bound score of a trajectory T i via considering the start/end point of a trajectory, as the start point and end point of the virtual trajectory are the same as that of T i , based on the spatio-temporal constraint, we can generate a more reasonable virtual trajectory.Accordingly, we can tighten up the lower-bound score T i .Intuitively, as the cost model we form is based on the lower-bound score of each T i , the pair-based lower-bound score calculation makes the model more workable.Lemma 1.Let T q be the query trajectory, and T i ∈ T be a trajectory.When d 1 ≥ 2t x U max , the lower-bound score of T i , i.e., denoted as b(T i ), equals (n − 1)(d 1 − t x U max ) + d n .
Proof.We prove it via forming two virtual trajectories VT i and VT q based on T i and T q .These virtual trajectories simulate the movement of objects.Initially, VT i and VT q are located at p 1 i and p 1 q , respectively.After t x time units, they move towards each other and reach p n i and p n q , respectively, by T i .e. Along their paths, the distance between vp i (j) and vp q (j) is d 1 − 2(j − 1)U max when j ≤ t x + 1, and 2t x U max (j−n) Lemma 2. Let T q be the query trajectory, and T i ∈ T be a trajectory, when d Proof.We prove it via forming another two virtual trajectories VT ′ i and VT ′ q based on T i and T q .Initially, VT ′ i and VT ′ q are located at p 1 i and p 1 q , respectively.After d 1 2Umax time units, they move towards each other and reach the same position vp.They stay at vp for t ′ x time units and then move to p n i and p n q , with t ′ x being t x − d 1 2Umax .Under these two paths, the distance between vp i (j) and vp q (j) equals d 1 − 2(j − 1)Umax when j ≤ d 1 2Umax + • 100 = 500 Thus, it can arrive at p 8 2 at T 2 .e. Therefore, t x is set to 3. Note, in implementation, we could use a binary search to find the maximal t x .As it is simple, for the limitation of space, we will skip the details.
We also find that if the partition is applied, the corresponding lower-/upper-bound score could be further tightened.Accordingly, we propose a partition-based method to tighten the lower/upper-bound score of the trajectories.Formally, for each trajectory contained in the trajectory set T , we partition into a group of sub-trajectories such that Here, T i (j, m i ) refers to the j-th sub-trajectory of T i , and its scale equals m i .T i (j, m i ) also contains a set of GPS points, which are the {(j − 1) • m i + 1, (j − 1) • m i + 2, ...j • m i }-th generated GPS points in T i .n i refers to the number of sub-trajectories.Based on the partition result, we can update b(T i ) and b(T i ) to ∑ j=n i j=1 b(T i (j, m i )) and ∑ j=n i j=1 b(T i (j, m i )), respectively.In addition, b(T i (j, m i )) and b(T i (j, m i )) could be calculated based on Theorem 2.
Back to the example in Figure 3a.If the partition amount is 1, b(T 2 ) is calculated as the sum of the real location distance D(p 1 2 , p 1 q ), D(p 8 2 , p 8 q ) and the virtual location distance ∑ i=7 i=2 D(vp i 2 , vp i q ).Because t x equals 3, D(vp 2  2 , vp 2 q ) = 3200 − 1•2•100 = 3 km, D(vp 3 2 , vp 3 q ) and D(vp 4 2 , vp 4 q ) is equal 2.8 km and 2.6 km, respectively.Then, D(vp 4  2 , p  Intuitively, if the partition amount is large, the running cost of maintaining the partition is high, but the lower-/upper-bound score is tight.In contrast, the lower-bound score is loose.It is significant to find a flexible method to self-adaptively partition trajectories based on the distance relationship among them to the query trajectory.

The Pair-Based Prediction Algorithm
In this section, we first explain the pair-based partition.Then, we explain the prediction algorithm.The Pair-based Partition Algorithms.It is to find query result trajectories via recursively partitioning trajectories and tightening their lower-/upper-bound scores.In this way, we can find query result trajectories, as well calculate the lower-bound score of other trajectories.Intuitively, once the lower-bound scores of trajectories are computed, we can use the model (discussed later) to evaluate the running cost we spend on it, which helps us effectively assign trajectories.
In this section, we use Theorem 2 for calculating the lower-bound score of subtrajectories.Intuitively, if pairs are considered for lower-bound score calculation, we could obtain a tighter lower-bound score.Accordingly, we propose the concept of τ − L− score.Here, we assume that each trajectory contains the set of n GPS points, with n being 2 τ and τ being an integer.In addition, the lower-bound score of each trajectory, i.e., denoted as b(T i , τ), is computed based on Equation ( 4), with r being n We now formally explain the partition algorithm.Firstly, we access each T i ∈ T , and compute b(T i ) and b(T i ) as the manner discussed in Theorem 2. After calculating the lower-bound scores/upper-bounds of all trajectories, we search T k , i.e., the trajectory with the k-th lowest upper-bound score, use b(T k ) for pruning trajectories in T .In other words, for each trajectory T i ∈ T , its lower-bound score is larger than b(T k ).It is not a query result trajectory, and we remove it to the set T ′ .
For the reminders, we split each element in T into two sub-trajectories with an equal scale, and update their lower-bound score based on Equation (4).Again, for each of them, if existing k trajectories have an upper-bound score lower than it, it is removed from T to the set T ′ .From then on, we repeat the above operations until the number of trajectories in T reduces to k.At that moment, we can use these k trajectories as query result trajectories in the current window.Note, during the partition, we can form the corresponding PDSP structure for each trajectory.The PDSP-based Prediction.After we form the PDSP for each trajectory, we are going to predict the earliest moment, i.e., denoted as T i .t, each trajectory T i may become a query result trajectory.In this way, we need not to monitor T i before T i .t. Specially, we scan each trajectory T i and calculate T i .tbased on Theorem 1.After calculation, we use a prior queue to maintain these trajectories based on T i .t of each T i .As the algorithm is simple, we skip the details to save space.Theorem 1. Trajectories T i , T k , and T q , D(T i ) are no smaller than D(T k ) after δ time units with δ being computed based on the in-Equation b(T i , τ)≤D(T k ).

Proof. We assume that
Theorem 1 is proved under the following two cases: (i)δ > n; (ii) δ ≤ n.Under case (i), we use the manner discussed in Theorem 1 to find the maximal δ.
We want to highlight that, once T i .t of each T i is calculated, we can ensure that T i cannot become a query result trajectory before T i .t.In addition, we can evaluate the running cost we will spend on it.It is convenient to evaluate its importance to the query result sets and make a reasonable partition based on the prediction result.

Trajectory Set Partition Algorithm
As the number of trajectories is usually large, it is difficult to process a large number of trajectories over a single server.Thus, in this section, we study how to form a reason partition that partitions trajectories into a group of servers, making sure that the workload of each server is as similar as possible.In order to achieve this goal, we first form a model to evaluate the running cost of each trajectory.
Specially, let T i be a trajectory in the trajectory set T , T i (C) be the number of points we access when evaluating its prediction moment, and T i (t) be the prediction moment of T i .After n time units are passed, the excepted time we access T i is n W(t)−T i (t) .Accordingly, the excepted running cost of maintaining T i could be calculated via Equation (5).In addition, the excepted running cost of maintaining all trajectories could be computed based on Equation (6).
After explaining the cost model, we now formally explain the partition algorithm.Our goal is to make the workload of each server is as the same as possible.
Let S be the set of servers.For each server S i ∈ S, we use S i (c) to record the current workload of S i , and servers in S are sorted in ascended order by their workload.We use the greedy algorithm to form the partition.Specially, we scan each trajectory T i ∈ T , compute its excepted running cost via Equation (6), allocate it to the server S i with minimal workload, and finally update S i (C).

The Incremental Maintenance Algorithms
In this section, we first discuss the incremental algorithm over each server, i.e., the local incremental maintenance algorithm.Its function is to update the prediction moment of trajectories in each server.Next, we explain the global incremental maintenance algorithm.Its function is to guarantee the workload of each server is as similar as possible.

The Local Incremental Maintenance Algorithm
In the following, we propose the algorithm INC-PDSP, which explains how PDSP is applied for supporting the data stream.Here, I c and I q are two inverted lists that maintain non-query result trajectories and query result trajectories, respectively.Our algorithm is proposed based on the following observation.That is, the prediction moment of a trajectory is computed based on the assumption that trajectories in I q move far away from T q .It is actually not true in real applications.A moment of thought could reveal that if θ k is not rapidly increased, we can delay the prediction moment updating of trajectories.Here, θ k refers to the k-th lowest score among all trajectories in the query result set.
We now formally provide the algorithm details.We associate I c with a variable named I c .g with I c being the inverted list that maintains all non-query result trajectories.It records the sum of time units we can delay.Its value is set to 0 at the moment I c is constructed.Whenever one time unit is passed, we update I c .g to I c .g + η, where η is computed based on Equation (7).Here, θ 0 k refers to the k-th highest score among elements in I q at the last time unit.U max + θ 0 k −D(p 1 k , p 1 q ) refers to the predicted score of T k at the current time unit.U max + θ 0 k −D(p 1 k , p 1 q )− θ k refers to the difference between the predicted θ k and real θ k .In addition, we associate each element T i ∈ I c with a variable named T i .l.Its value equals I c .g at the moment T i 's prediction moment is updated.Whenever one time unit is passed, we update I c .g based on Equation (7).Next, we check whether the element T i located at the top of I c satisfying T i .t≤ t now − I c .g + T i .l.If the answer is yes, we update its prediction moment.Lastly, we set T i .l to I c .g.

The Global Incremental Maintenance Algorithm
We should monitor the workload of each server so as to guarantee the balance of the system.Thus, we should update S i (C) of each server whenever one trajectory T in S i is passed.Specially, let T be the trajectory we should evaluate.We first update its prediction moment.Next, we re-calculate its excepted running cost and update S i (C) based on the re-calculating result.
The reason we monitor the workload of each server is we should re-allocate trajectories if the workload difference among different servers is so large.In our paper, if S max > 2S min , we should re-allocate trajectories in the server S max and S min .Here, S max and S min refer to the server with maximal and minimal workload in S. When re-allocating, we scan trajectories in S max and remove each scanned trajectory from S max to S min .When S max is reduced to less than S min , the algorithm is terminated.

Performance Evaluation
In this section, we conduct extensive experiments to demonstrate the efficiency of the RNDLP framework.The experiments are based on both real datasets and synthetic datasets.In the following, we first explain the datasets used in our experiments and the settings of our experiments and then report our findings.

Experiment Settings
Datasets.In total, four datasets are utilized in our experiments, comprising three real datasets: BEIJING, PORTO, and NYC, and a synthetic dataset named NORMAL.BEIJING is sourced from the Microsoft T-Drive project, encompassing GPS trajectories of 10,357 taxis recorded from 2 February to 8 February 2008.PORTO consists of 1,710,671 trajectories, describing the trajectory of 442 taxis from 7 January 2013 to 30 June 2014 in the city of Porto.NYC is obtained from the New York City Taxi and Limousine Commission, containing 2.36 GB of trip records.Accordingly, we generate a group of trajectories based on these records.Each record corresponds to a trajectory, where the start point and end point of the trajectory is set based on the pick-up point and drop-off point of the record.The trajectory is generated based on the shortest path based on the pick-up point and drop-off point of the record.
NORMAL is a synthetic trajectory dataset created by simulating the trajectories of moving objects on urban roads.The road network under these four datasets are shown in Table 1.We pre-process these datasets by retaining only four attributes, namely taxi ID, location longitude, location latitude, and timestamp.Note, the running time of our proposed framework is unrelated with the scale of the graph.The reason behind it is we use a hash table to maintain the shortest path among every two vertexes, and we can use O(1) running cost to caluate the distance between two point over road network.Experimental Methodology.In our study, we first load the road network, the hash table, as well as all trajectories into the memory.Next, we scan each trajectory T i in the trajectory set, compute its score or lower-bound score.After scanning, we assign these trajectories to different servers based on the model discussed before.From then on, we monitor the alarm time of each trajectory and update the scores of trajectories with their alarm time being the current time unit.When all trajectory are processed, we report the total running time.
Parameters.In our study, we evaluate the performance of different algorithms under four parameters: N, n, k, and U max .Here, N denotes the number of trajectories within the trajectory set T ; n represents the number of GPS points within each trajectory; k is the input parameter for CKTRN; and U max denotes the maximal traveling speed of objects per time unit.The parameter settings are presented in Table 2, with the default values bolded.

Performance Comparison
Updating Cost Comparison.In this section, we compare the performance of RNDLP with its competitors when supporting CKTRN under a data stream.We present the running time of all algorithms under different k values in Figure 4a-d.Across all evaluated k values, RNDLP consistently outperforms BASE for all four datasets.For instance, in the PORTO dataset, the running time of RNDLP is only 0.58% of BASE, in the BEIJING dataset, it is 0.81% of BASE, and in the NORMAL dataset, it is 0.69% of BASE.The notable improvement arises from RNDLP considering spatio-temporal correlation among GPS points in each trajectory.The corresponding virtual trajectory is closer to the real trajectory, allowing it to provide trajectories with tight score bounds by accessing a small number of GPS points.In addition, RNDLP uses hash table to maintain the distance among every two vertexes over road network.
Furthermore, we report the running time of different algorithms under various N values in Figure 4e-h.As N increases, the running time of BASE sharply rises because BASE has to access all trajectories whenever the window slides.In contrast, RNDLP is not sensitive to N values thanks to the predictive nature of its employed algorithms.It does not need to maintain N trajectories in real time, resulting in a much more stable performance under various parameter settings.Besides the reasons discussed above, another important reason is we use a cost model to partition trajectories into different servers.In this way, the overhead of each server is roughly the same.As a contrast, BASE does not consider how to equally partition trajectories.Thus, in many cases, trajectories are skewed when distributed in each server.Thus, the total running cost of BASE is higher than that of RNDLP.
The running time of different algorithms under various n values is reported in Figure 4i-l.Here again, RNDLP outperforms its counterparts.Notably, BASE's running time gradually increases with the growth of n since it has to access more GPS points when trajectories' prediction moments are updated.Conversely, RNDLP's running time does not change significantly with increased n values.This is because the larger the n, the larger the distance among starting points of objects' trajectories to their end points.In this way, RNDLP could accurately calculate score bounds of trajectories via setting τ to a small value.In addition, as RNDLP could dynamically adjust prediction moment of trajectories, it could further reduce the cost of incremental maintenance.The running time of different algorithms under different U max is reported in Figure 4m-p.We find that RNDLP performed better again.In addition, BASE is not sensitive to U max .The reason is BASE does not consider speed constraint.The running time of RNDLP gradually increases, but still spends much lower cost than the BASE algorithm.This is because the larger the U max , the looser the lower-bound score BASE could provide.However, as U max is usually not high in most applications, RNDLP is the most efficient in most cases (Table 3).To sum up, RNDLP is both stable and efficient.It requires the lowest running time to support continuous k-similarity search over road network compared with the BASE algorithm.

Conclusions
In this paper, we propose a novel framework named SLBP to support continuous ksimilarity trajectories search over road network.The framework can efficiently return query result trajectories by accessing only a few GPS points from a subset of trajectories within the entire trajectory set.Additionally, we propose a pair-based method to enhance algorithm performance.Through the calculation of lower-bound scores for trajectories, we observe that processing a small number of trajectories with each slide of the window is sufficient.As a result, our framework efficiently supports continuous k-similarity trajectories search over data streams.We conducted extensive experiments to evaluate the performance of our proposed algorithms on several datasets.The results consistently demonstrate the superior performance of our proposed algorithms.

Figure 1 .
Figure 1.Continuous k-similarity trajectory search over road network (k = 1), where (a) denotes the initial case of a given trajectory; (b) denotes the update of the trajectory and query results after one unit of time has elapsed.

Figure 3 .
Figure 3. Pair-based lower-bound score calculation, where (a) and (b) respectively represent the graphical depiction of the distance from o 2 to p 8 2 over different time units.Discussion.Pair-based partition can provide trajectories with tighter lower-bound score.However, a natural question is how to form proper partitions for different trajectories.Intuitively, if the partition amount is large, the running cost of maintaining the partition is high, but the lower-/upper-bound score is tight.In contrast, the lower-bound score is loose.It is significant to find a flexible method to self-adaptively partition trajectories based on the distance relationship among them to the query trajectory.

Figure 4 .
Figure 4. Running time comparison of different algorithms under different datasets.

Table 1 .
Road network information.

Table 2 .
Parameter settings.Performance Metrics.The updating time is employed as the main performance metric.It refers to the average time used to process newly generated GPS points.Competitors.In addition to the algorithm included in the RNDLP framework, we also implement a baseline algorithm named BASE.Note, the BASE algorithm updates scores of all trajectories whenever one time unit is passed.All the algorithms are implemented with C++, and all the experiments are conducted on 6226R CPU with 256 GB memory, running Microsoft Windows 10.

Table 3 .
Performance analysis of different similarity measures.