Cluster Nested Loop k-Farthest Neighbor Join Algorithm for Spatial Networks

This paper considers k-farthest neighbor (kFN) join queries in spatial networks where the distance between two points is the length of the shortest path connecting them. Given a positive integer k, a set of query points Q, and a set of data points P, the kFN join query retrieves the k data points farthest from each query point in Q. There are many real-life applications using kFN join queries, including artificial intelligence, computational geometry, information retrieval, and pattern recognition. However, the solutions based on the Euclidean distance or nearest neighbor search are not suitable for our purpose due to the difference in the problem definition. Therefore, this paper proposes a cluster nested loop join (CNLJ) algorithm, which clusters query points (data points) into query clusters (data clusters) and reduces the number of kFN queries required to perform the kFN join. An empirical study was performed using real-life roadmaps to confirm the superiority and scalability of the CNLJ algorithm compared to the conventional solutions in various conditions.


Introduction
In this study, we investigate the efficient processing of k-farthest neighbor (kFN) join queries in spatial networks where the distance between two points is defined by the length of the shortest path connecting them. The kFN join combines each query point q in Q with the k data points in P that are farthest from the query point q, given a positive integer k, a set of query points Q, and a set of data points P. The kFN join query has real-life applications in recommender systems, where farthest neighbors can increase the variety of recommendations [1,2]. Farthest neighbor search is also an element in clustering applications [3], complete linkage clustering [4], and nonlinear dimensionality reduction algorithms [5]. Thus, being able to quickly process kFN join queries is an important practical concern for many applications [6][7][8][9][10][11][12][13][14]. Figure 1 shows an example of the kFN join between a set Q of query points and a set P of data points in a spatial network, where it is assumed that k = 1, Q = {q 1 , q 2 , q 3 }, and P = {p 1 , p 2 , p 3 , p 4 } are given. In this paper, the kFN join is denoted as Q kFN P. In this example, the data points farthest away from q 1 , q 2 , and q 3 are p 2 , p 2 , and p 3 , respectively, which can be represented by Q kFN P = { q 1 , p 2 , q 2 , p 2 , q 3 , p 3 }. Conversely, the query points farthest from p 1 , p 2 , p 3 , and p 4 are q 1 , q 1 , q 3 , and q 1 , respectively, which can be represented by P kFN Q = { p 1 , q 1 , p 2 , q 1 , p 3 , q 3 , p 4 , q 1 }. This simply proves that kFN joins are not commutative, i.e., Q kFN P = P kFN Q. Note that this study considers Q kFN P. The facility location problem, which determines the competitive location of a new facility, such as garbage incinerators, crematoriums, chemical plants, supermarkets, and police stations, is very important in real life when using the kFN join query applications. Particularly, determining the optimal facility location is still an open problem [15,16]. Facing such a research problem, efficiently evaluating the kFN join query is remarkably useful. Assume that query points q 1 through q 3 represent unpleasant facilities such as garbage incinerators and chemical plants, whereas data points p 1 through p 4 represent available rental apartments. This example may consider a FN join between a set Q of unpleasant facilities and a set P of available rental apartments, which could be "find ordered pairs of an unpleasant facility q and available rental apartment p such that the rental apartment p is farther from the unpleasant facility q than the other rental apartments available." Naturally, p 2 or p 3 may be the competitive apartment in terms of the distance to unpleasant facilities. data point query point The kFN join query should repeatedly compute the distances between each pair of data and query points, which leads to a long query processing time. A simple solution to the kFN join query between a query set Q and a dataset P repeatedly scans all data points in P for each query point in Q to compute the distance between each pair of query and data points q, p . This simple solution is unacceptable in most cases because it repeatedly retrieves candidate data points for each query point. It may, however, be considered in cases where query points are uniformly distributed throughout the region. However, kFN join queries have not received adequate attention for spatial networks, despite their importance. This paper proposes a cluster nested loop join (CNLJ) algorithm for spatial networks to solve the problem of efficiently processing kFN join queries. Specifically, using the spatial network connection, query points (data points) are clustered into query clusters (data clusters). The CNLJ algorithm exploits a shared computation for query clusters to avoid unnecessary computations of the distances between data and query points. The CNLJ algorithm has several advantages over the traditional solution: (1) it clusters query points (data points) using the spatial network connection for the shared computation, (2) it quickly retrieves candidate data points at once for clustered query points, and (3) it does not retrieve candidate data points for each query point separately. To the best of our knowledge, this is the first attempt to study kFN join queries for spatial networks.
The primary contributions of this study are listed as follows: • This paper presents a cluster nested loop join algorithm for quickly evaluating spatial network kFN join queries. The CNLJ algorithm clusters query points before retrieving candidate data points for clustered query points all at once. As a result, it does not retrieve candidate data points for each query point multiple times. • The CNLJ algorithm's correctness is demonstrated through mathematical reasoning. In addition, a theoretical analysis is provided to clarify the benefits and drawbacks of the CNLJ algorithm concerning query point spatial compactness. • An empirical study with various setups was conducted to demonstrate the superiority and scalability of the CNLJ algorithm. The CNLJ algorithm outperforms the conventional join algorithms by up to 50.8 times according to the results.
The remainder of this paper is organized as follows: Section 2 reviews related research and provides some background knowledge. Section 3 describes the clustering of query points (data points) and the computing of the maximum and minimum distances between a border point and a data cluster. Section 4 presents the CNLJ algorithm for rapidly evaluating kFN join queries in spatial networks. Section 5 presents the results of experiments using the CNLJ and conventional join algorithms with different setups. Finally, the conclusions of this study are discussed in Section 6.

Background
Section 2.1 presents related works and Section 2.2 defines the terms and notations used in this study.

Related Work
Many studies have considered spatial queries based on the farthest neighbor (FN) search [6][7][8][9][10][11]13,14,[17][18][19][20]. Korn and Muthukrishnan [21] pioneered the concept of a reverse farthest neighbor (RFN) query to obtain the weak influence set. Given a set of data points P and a query point q, the RFN query retrieves a set of data points p ∈ P such that q is their farthest neighbor among all points in P {q}. This is the monochromatic RFN (MRFN) query [8,9,13,14,19]. Another version of the RFN query is the bichromatic reverse farthest neighbor (BRFN) query [10,13,14,22]. Given a set of data points P, a set of query points Q, and a query point q in Q, the BRFN query retrieves a set of data points p in P such that q is the farthest neighbor of p among all query points in Q. Many studies have been conducted to process RFN queries for the Euclidean space [8,9,14,19,22] and for spatial networks [10,13]. Yao et al. [14] proposed progressive farthest cell and convex hull farthest cell algorithms to answer RFN queries using an R-tree [23,24]. A solution to answer reverse kFN queries in the Euclidean space was presented for arbitrary values of k [22]. Liu et al. [19] proposed the concept of group RkFN query in the obstacle space and presented a query optimization algorithm based on the Voronoi graph. Tran et al. [10] proposed a solution for RFN queries and RkFN queries in road networks by using Voronoi-diagram-related attributes and Dijkstra's algorithm. Xu et al. [13] presented efficient algorithms based on landmarks and hierarchical partitioning to process monochromatic and bichromatic RFN queries in spatial networks. The approximate version of the problem, known as the c-approximate farthest neighbor (c-AFN) search, has been actively studied due to the difficulty in designing an efficient method for exact FN search in high-dimensional space [6,17,18,25]. Huang et al. [18,25] introduced a new concept of reverse locality-sensitive hashing (LSH) family and developed reverse query-aware LSH functions. They proposed two hashing schemes for high-dimensional c-AFN search over external memory. Liu et al. [17] developed an approximate algorithm with a theoretical guarantee for high-dimensional c-AFN search over external memory. Curtin et al. [6] proposed an algorithm with an absolute approximation guarantee for the FN search in the high-dimensional space. To estimate the difficulty of the FN search problem, an information-theoretical measure of hardness was presented [6]. Farthest dominated location (FDL) queries were proposed in [26]. An FDL query retrieves the location s ∈ L such that the distance to its nearest dominating object in P is maximized given a set of data points P with spatial and nonspatial attributes, a set L of candidate locations, and a design competence vector Ψ for L. Gao et al. [7] studied aggregate k-farthest neighbor (AkFN) queries that are defined by aggregation functions, such as min, max, and sum, and presented the MB and BF algorithms based on the R tree [23,24]. An AkFN query retrieves the k data points in P with the largest aggregate distances to all query points in Q given a set of data points P and a set of query points Q. In spatial networks, effective solutions to AkFN queries were proposed [11].
Due to the differences in the properties between the shortest path distance and the Euclidean distance, existing solutions based on the Euclidean space cannot be used directly to answer kFN join queries in spatial networks. The existing solutions for nearest neighbor search [27][28][29] cannot readily be used to address the farthest neighbor search problems due to the different distance properties between farness and nearness. Although the group computation of spatial queries has received considerable attention [19,27,[30][31][32][33][34], group computation has not been applied to kFN join queries for spatial networks. To efficiently process kFN join queries in spatial networks, new sophisticated algorithms must be developed. First, the kFN join is a costly operation by definition. Second, farthest neighbor search is more difficult than nearest neighbor search. Finally, designing index structures that effectively support the FN search for spatial networks is difficult. In terms of the space domain, query type, and data type, Table 1 compares our problem scenario to existing studies.

Notation and Formal Problem Description
Query and data points are placed in a spatial network G and these points represent points of interest (POIs), as shown in Figure 1. Given two points q and p, dist(q, p) is the length of the shortest path between q and p in G. Table 2 summarizes the symbols used in this study.

Symbol Definition
k Number of requested FNs Q and q A set Q of query points and query point q in Q, respectively P and p A set P of data points and data point p in P, respectively v l v l+1 · · · v m Vertex sequence where v l and v m are either an intersection vertex or a terminal vertex and the other vertices, v l+1 , . . ., v m−1 , are intermediate vertices q i q i+1 · · · q j Query segment connecting query points q i , q i+1 , · · ·, q j in a vertex sequence (in short, q i q j ) p l p l+1 · · · p m Data segment connecting data points p l , p l+1 , · · ·, p m in a vertex sequence (in short, p l p m ) Q C and P C Set of query segments and set of data segments, respectively Q and P Set of query clusters and set of data clusters, respectively B(Q C ) and B(P C ) Sets of border points of Q C and P C , respectively b q and b p Border points of Q C and P C , respectively Ω(q) Set of k data points farthest from a query point q dist(q, p) Length of the shortest path connecting points q and p len(qp) Length of the segment qp Definition 1. kFN search [6][7][8][9][10][11]13,14]. Given a positive integer k, a query point q, and a set P of data points, the query point q returns a set of k data points, denoted as Ω(q), such that dist(q, p + ) ≥ dist(q, p − ) holds for ∀p + ∈ Ω(q) and ∀p − ∈ P−Ω(q).
Definition 2. kFN join. Given a positive integer k, a set of query points Q, and a set of data points P, the kFN join query, denoted as Q kFN P, returns ordered pairs of each query point q in Q and a set of k data points farthest from q. For simplicity, Q kFN P is abbreviated to Q P, which is formally defined by Q P = { q, Ω(q) |∀q ∈ Q}. Note that the kFN joins are not commutative, i.e., Q P =P Q.
Definition 3. Spatial network [32,33,[36][37][38]. A weighted undirected graph G = V, E, W is used to represent a spatial network, where V, E, and W represent the vertex set, edge set, and edge distance matrix, respectively. Each edge has a non-negative weight that indicates the network distance. Definition 5. Vertex sequence, query segment, and data segment. A vertex sequence v l v l+1 · · · v m denotes a path between two vertices, v l and v m , such that v l and v m are either an intersection vertex or a terminal vertex, and the other vertices in the path, v l+1 , . . . , v m−1 , are intermediate vertices.
A query segment q i q i+1 · · · q j denotes a line segment connecting query points q i , q i+1 , · · · , q j and a data segment p l p l+1 · · · p m denotes a line segment connecting data points p l , p l+1 , · · · , p m . For simplicity, q i q j and p l p m are abbreviated to q i q i+1 · · · q j and p l p l+1 · · · p m , respectively, to reduce confusion.

Clustering Points and Computing Distances
In Section 3.1, we group query points (data points) by using the spatial network connection. We calculate the maximum and minimum distances between a border point and a data cluster in Section 3.2. Figure 2 illustrates an example of the kFN join Q P, where k = 2, Q = {q 1 , q 2 , q 3 , q 4 }, and P = {p 1 , p 2 , · · · , p 6 } are given. The example kFN join query requires that each query point q in Q finds two data points farthest from q.  Figure 3 shows an example of the two-step clustering method to group nearby query points into query clusters. A query segment is created in the first step by connecting query points in a vertex sequence. In Figure 3a, query points q 1 and q 2 in a vertex sequence q 1 q 2 v 2 are connected to become q 1 q 2 . Thus, three query segments q 1 q 2 , q 3 , and q 4 are generated, as shown in Figure 3a. In the second step, an intersection vertex is used to connect adjacent query segments to form a query cluster. In Figure 3b, the intersection vertex q 1 connects two query segments q 1 q 2 and q 4 . Similarly, q 3 and q 4 are linked by the intersection vertex v 1 . Finally, q 1 q 2 and q 3 are linked by the intersection vertex v 2 . As a result, three query segments q 1 q 2 , q 3 , and q 4 are linked to form a query cluster {q 1 q 2 v 2 , q 1 q 4 v 1 , v 1 q 3 v 2 }. Note that a query cluster is a set of query segments. Naturally, a set of query points Q = {q 1 , q 2 , q 3 , q 4 } is converted into a set of query clusters Q = {{q 1 q 2 v 2 , q 1 q 4 v 1 , v 1 q 3 v 2 }}. Let us define a border point for a query cluster Q C . When a query cluster Q C and its nonquery cluster G − Q C meet at a point, that point is referred to as the border point of Q C . In this example, three border points q 1 , v 1 , and v 2 are found for Figure 4 shows an example of a two-step clustering method to group neighboring data points into data clusters. Notably, the query and data points are clustered using the same two-step method. In the first step, data points p 1 , p 2 , and p 3 in a vertex sequence v 1 p 2 v 3 are connected to become a data segment p 1 p 2 p 3 . Similarly, data points p 4 and p 5 in a vertex sequence p 5 p 4 q 1 are linked to form a data segment p 4 p 5 . As a result, three data segments p 1 p 2 p 3 , p 4 p 5 , and p 6 are generated, as illustrated in Figure 4a. In the second step, two data segments p 4 p 5 and p 6 are joined by an intersection vertex p 5 to form a data cluster {p 4 p 5 , p 5 p 6 }. As a result, a set of data points P = {p 1 , p 2 , · · · , p 6 } is transformed into a set of data clusters P = {{p 1 p 2 p 3 }, {p 4 p 5 , p 5 p 6 }}.  . Two-step clustering method to group nearby data points into a data cluster: (a) converting data points into data segments; (b) converting data segments into data clusters.

Computing Maximum and Minimum Distances from a Border Point to a Data Cluster
The maximum and minimum distances between a border point b q and a data cluster P C are computed in this section. The minimum and maximum distances between b q and P C are formally defined by mindist(b q , P C ) = min{dist(b q , p)|p ∈ P C } and maxdist(b q , P C ) = max{dist(b q , p)|p ∈ P C }, respectively. The minimum distance between b q and P C can be easily calculated by mindist(b q , P C ) = min{dist(b q , b p )|b p ∈ B(P C )} where b p is a border point of a data cluster P C . The maximum distance between b q and P C can be represented by maxdist(b q , P C ) = max{maxdist(b q , p l p m )|p l p m ∈ P C } where maxdist(b q , p l p m ) is the maximum distance between b q and a data segment p l p m in P C .

Cluster Nested Loop Join Algorithm for Spatial Networks
The CNLJ algorithm is described in Section 4.1. Section 4.2 evaluates k FNs queries at the border points of query clusters. Finally, the example k FNs join query is evaluated in Section 4.3.

Cluster Nested Loop Join Algorithm
The CNLJ algorithm is described in Algorithm 1, which involves two steps. The two-step clustering method (lines 2-4), which is described in Section 3.1, is used to group nearby query points (data points) into query clusters (data clusters) in the first step. In the second step, the kFN join is performed for each query cluster in Q (lines 5-8). Finally, the kFN join result Ω(Q) is returned to a query user when the kFN join is complete for each query cluster in Q (line 9).

Algorithm 1 CNLJ(k, Q, P).
Input: k: number of FNs for q, Q: set of query points, and P: set of data points Output: Ω(Q): Set of ordered pairs of each query point q in Q and a set of k FNs for q, i.e., Ω(Q) ={ q, Ω(q) |q ∈ Q}. 1: Ω(Q) ← ∅ // The result set Ω(Q) is initialized to the empty set. 2: // Step 1: Query and data points are clustered, which is presented in Section 3.1. 3: Q ← two_step_clustering(Q) // Query points are grouped into query clusters. 4: P ← two_step_clustering(P) // Data points are grouped into data clusters. 5: // Step 2: The kFN join is performed for each query cluster in Q, which is presented in Algorithm 2. 6: for each query cluster Q C ∈ Q do 7: Ω(Q C ) ← kFN_join(k, Q C , P) // Ω(Q C ) ={ q, Ω(q) |q ∈ Q C }. 8: ∪ Ω(Q C ) // the kFN join result for Q C is added to Ω(Q). 9: return Ω(Q) // Ω(Q) is returned once the kFN join for every query cluster in Q is complete.
Algorithm 2 describes the kFN join algorithm for a query cluster Q C . First, kFN queries are evaluated at the border points of Q C to collect the candidate data points for query points in Q C (lines 4-7), which is described in Algorithm 3. Then, each query point q in Q C retrieves the kFNs for q among the candidate data points in Ω(B(Q C )) (lines [8][9][10][11], which is detailed in Algorithm 4. Finally, the kFN join result Ω(Q C ) for query points in Q C is returned after each query point q in Q C retrieves the kFNs for q from the candidate data points (line 12).
Algorithm 2 kFN_join(k, Q C , P). Input: k: number of FNs for q, Q C : query cluster, and P: set of data clusters Output: Ω(Q C ): Set of ordered pairs of each query point q in Q C and a set of k FNs for q, i.e., Ω(Q C )={ q, Ω(q) |q ∈ Q C } 1: Ω(Q C ) ← ∅ // The result set for query points in Q C is initialized to the empty set. 2: // l indicates the maximum distance between border points in Q C . 4: // Step 1: kFN query is evaluated at each border point b q in Q C 5: for each border point b q ∈ B(Q C ) do 6: // kFN query is evaluated at b q , which is detailed in Algorithm 3. 7: ) collects candidate data points for query points in Q C . 8: // Step 2: Each query point q retrieves k FNs among the candidate data points in Ω(B(Q C )). 9: for each query point q ∈ Q C do 10: ) is the set of candidate data points for q. 11: Algorithm 3 describes the kFN query processing algorithm for finding candidate data points at a border point b q for a query cluster. Note that the kFN query result for b q includes candidate data points for query points in Q C . The set of kFNs for b q , Ω(b q ), is initialized to the empty set (line 1). The third argument l indicates the maximum distance between border points in Q C , i.e., l←max{dist(b q i , b q j )|b q i , b q j ∈ B(Q C )}. The sentinel distance is initialized to sntl_dist←0 and determines whether a data point p is a candidate point for Q C . The maximum and minimum distances from b q to data clusters in P are computed, as described in Section 3.2. The data clusters are then sorted in decreasing order of their maximum distance to b q . Naturally, the sorted data clusters are explored sequentially. If the maximum distance from b q to the data cluster P C to be explored next is smaller than the sentinel distance, i.e., maxdist(b q , P C ) < sntl_dist, the remaining unexplored data clusters do not need to be considered because the data points in these data clusters can be candidate data points for no query point in Q C . Otherwise (i.e., maxdist(b q , P C ) ≥ sntl_dist), each data point p in P C needs to be examined to determine whether p is a candidate point for query points in Q C . For this, dist(b q , p) is computed. If b q is inside P C , then the distance from b q to p is simply computed using a graph search algorithm such as Dijkstra's algorithm [39]. Otherwise (i.e., if b q is outside P C ), the distance evaluates to dist(b q , p)←min{dist(b q , b p )+dist(b p , p)|b p ∈ B(P C )}. Note that b q is a border point of Q C and b p is a border point of P C . This is because if b q is outside P C , the shortest path from b q to p should pass through a border point b p of P C , i.e., b q → b p → p. If dist(b q , p) ≥ sntl_dist, then p is added to Ω(b q ) as a candidate data point for Q C . Redundant data points p may be included in Ω(b q ) and they should be removed from Ω(b q ). Thus, each data point p in Ω(b q ) is explored to verify that p is qualified to be a candidate data point, i.e., dist(b q , p)≥sntl_dist. If p does not satisfy the qualification, it is removed from Ω(b q ). Finally, if the maximum distance from b q to the data cluster P C is smaller than the sentinel distance (lines [10][11][12] or if every data cluster is examined, the k FN query result for b q , Ω(b q ), is returned.

Algorithm 3 f ind_candidates(k, l, b q , P).
Input: k: number of FNs for q, l: maximum distance between border points in Q C , b q : border point of Q C , and P: set of data clusters Output: Ω(b q ): Set of k FNs for b q 1: Ω(b q ) ← ∅ // The set of k FNs for a border point b q , Ω(b q ), is initialized to the empty set. 2: sntl_dist ← 0 // t = The sentinel distance sntl_dist is initialized to 0. 3: // The maximum and minimum distances from b q to data clusters in P are computed as explained in Section 3.2. 4: for each data cluster P C ∈ P do 5: compute maxdist(b q , P C ) and mindist(b q , P C ) 6: // The data clusters in P are sorted in decreasing order of their maximum distance to b q 7: P ← sort_data_clusters(P) // P contains the sorted data clusters for b q . 8: // Data clusters are explored sequentially. 9: for each sorted data cluster P C ∈ P do 10: if maxdist(b q , P C ) < sntl_dist then 11: // Note that sntl_dist is updated as shown in line 24.

12:
Go to line 26 // This means that the other data clusters do not need to be explored. 13: // Each data point p in P C is sequentially explored to find k FNs for b q . 14: for each data point p ∈ P C do 15: // dist(b q , p) is computed using the following two cases. b q ∈ P C and b q / ∈ P C . 16: if b q is inside P C then 17: dist(b q , p) is simply computed using a graph search algorithm such as Dijkstra's algorithm [39] 18: else 19:

24:
sntl_dist ← dist(b q , p kth ) − l // p kth is the current kth FN of b q . 25: // Redundant data points p are removed from Ω(b q ) because they can be kFNs of no query point in Q C 26: for each data point p ∈ Ω(b q ) do 27: if dist(b q , p) < sntl_dist then 28: // p is no candidate data point for Q C and it is removed from Ω(b q ). 29: return Ω(b q ) // Ω(b q ) is returned after candidate data points are collected for query points in Q C .
Algorithm 4 describes that a query point q in Q C retrieves k FNs for q among candidate data points in Ω(B(Q C )). First, Ω(q) is initialized to the empty set. The distance between q and a candidate data point p is computed using the following two cases: p ∈ Q C and p / ∈ Q C . If p is inside Q C , i.e., p ∈ Q C , then the distance from q to p is simply computed using a graph search algorithm [39]. Otherwise (i.e., p / ∈ Q C ), the distance evaluates to dist(q, p) ← min{dist(q, b q ) + dist(b q , p)|b q ∈ B(Q C )}. This is because the shortest path from q to p should pass through a border point b q of Q C , i.e., q → b q → p. When dist(q, p) is computed, the following two conditions are checked to determine whether the data point p is added to Ω(q): If the cardinality of Ω(q) is smaller than k, i.e., |Ω(q)| < k, then p is simply added to Ω(q). Furthermore, if p is farther from q than the current kth FN p kth of q, i.e., dist(q, p) > dist(q, p kth ), then p is added to Ω(q) and p kth is removed from Ω(q). Finally, when exploration of every candidate data point is complete, the kFN query result for q, Ω(q), is returned.

Algorithm 4 retrieve_kFN(k, q, Ω(B(Q C ))).
Input: k: number of FNs for q, q: query point in Q C , Ω(B(Q C )): set of candidate data points for q Output: Ω(q): set of k FNs for q 1: Ω(q) ← ∅ // Ω(q) is initialized to the empty set. 2: // Ω(B(Q C )) is the set of candidate data points for q. 3: for each candidate data point p ∈ Ω(B(Q C )) do 4: if p is inside Q C then 5: dist(q, p) is simply computed using a graph search algorithm like Dijkstra's algorithm [39] 6: else 7: dist(q, p) ← min{dist(q, b q ) + dist(b q , p)|b q ∈ B(Q C )} // note that dist(b q , p) was computed in Algorithm 3.

8:
// p is added to Ω(q) if it satisfies either of the two conditions below. 9: if |Ω(q)| < k then 10: else if |Ω(q)| = k and dist(q, p) > dist(q, p kth ) then 12: // note that p kth is the current kth FN of q. 13: Lemma 1 proves that a query point q in a query cluster Q C can retrieve the k FNs for q among candidate data points in Ω(B(Q C )). Lemma 1. Each query point q in a query cluster Q C can retrieve the k FNs for q among candidate data points in Ω(B(Q C )).
Proof. Lemma 1 is proved by contradiction. For this, assume that there is a qualified data point p in Ω(q) and that p does not belong to Ω(B(Q C )), i.e., p ∈ Ω(q) and p / ∈ Ω(B(Q C )). The qualified data point p is farther from q than the kth FN p kth of border point b q of Q C , which means that dist(q, p) > dist(q, p kth ). According to Algorithm 3, it holds that dist(q, b q ) ≤ l and dist(b q , p kth ) − dist(b q , p) > l, where l = max{dist(b q i , b q j )|b q i , b q j ∈ B(Q C )}. Thus, the distance from q to p via b q is smaller than the distance from b q to p kth , i.e., dist(q, b q ) + dist(b q , p) < dist(b q , p kth ). This is because dist(q, b q ) ≤ l and dist(b q , p kth ) > dist(b q , p) + l are given. Clearly, dist(q, b q ) + dist(b q , p) < dist(b q , p kth ) implies that dist(q, p) < dist(q, p kth ). This leads to a contradiction to the assumption that dist(q, p) > dist(q, p kth ). Therefore, each query point q in a query cluster Q C can retrieve the k FNs for q among candidate data points in Ω(B(Q C )).
The CNLJ and nonclustering join algorithms for spatial networks have different time complexities, as shown in Table 4. Notably, the CNLJ algorithm is orthogonal to the kFN query processing algorithms, which can easily be incorporated into the CNLJ algorithm. The simple solution for finding k FNs for a single query point is used in this analysis for simplicity. The time complexity of the kFN query processing is O(|E|+|V| · log|V| + |P| · log|P| ). The CNLJ algorithm evaluates at most M· Q kFN queries, where M is the maximum number of border points of a query cluster Q C , i.e., M = max{ B(Q C ) | Q C ∈ Q}. The nonclustering join algorithm simply evaluates |Q| kFN queries because kFN queries for query points should be evaluated sequentially. Thus, the time complexities of the CNLJ and nonclustering join algorithms are O(|Q|·(|E|+|V| · log|V| + |P| · log|P| )) and O(|Q|·(|E|+|V| · log|V| + |P| · log|P| )), respectively. The theoretical results imply that the CNLJ algorithm runs faster than the nonclustering join algorithm, particularly when |Q| |Q|, i.e., the query points are densely clustered. In addition, the results imply that the CNLJ algorithm exhibits similar performance to the nonclustering join algorithm when |Q| ∼ = |Q|, i.e., the query points are not clustered.

Evaluating kFN Queries at Border Points
The CNLJ algorithm evaluates kFN queries at the border points of query clusters Q C . For the example kFN join query, the CNLJ algorithm evaluates kFN queries at border points q 1 , v 1 , and v 2 , rather than query points q 1 , q 2 , q 3 , and q 4 . First, the kFN query is evaluated at a border point q 1 . The maximum and minimum distances between q 1 and each data cluster in P are computed, and the data clusters are sorted in descending order based on their maximum distance to q 1 . As shown in Figure 11, the two data clusters {p 1 p 2 p 3 } and {p 4 p 5 , p 5 p 6 } are arranged using their maximum distance to q 1 as follows: P={{p 1 p 2 p 3 }, {p 4 p 5 , p 5 p 6 }}. This is because maxdist(q 1 , {p 1 p 2 p 3 }) = 28 and maxdist(q 1 , {p 4 p 5 , p 5 p 6 }) = 11, as described in Table 3. The border point q 1 investigates {p 1 p 2 p 3 } followed by {p 4 p 5 , p 5 p 6 }. Following an exploration of {p 1 p 2 p 3 }, q 1 selects p 2 and p 3 as the two FNs because dist(q 1 , p 1 ) = 24, dist(q 1 , p 2 ) = 25, and dist(q 1 , p 3 ) = 27 are computed, as shown in Figure 5. The sentinel distance for q 1 is sntl_dist = 20. This is because the maximum distance l between the border points in Q C is l = dist(q 1 , v 1 ) = 5, whereas the distance from q 1 to its second FN p 2 is dist(q 1 , p 2 ) = 25. Thus, a set of candidate data points for query points in Q C is Ω(q 1 )={p 1 , p 2 , p 3 }. This is because dist(q 1 , p 1 ) ≥ sntl_dist, dist(q 1 , p 2 ) ≥ sntl_dist, and dist(q 1 , p 3 ) ≥ sntl_dist. Clearly, q 1 no longer examines the other data cluster {p 4 p 5 , p 5 p 6 }. This is because sntl_dist is larger than maxdist(q 1 , {p 4 p 5 , p 5 p 6 }) where maxdist(q 1 , {p 4 p 5 , p 5 p 6 }) = 11, as shown in Table 3. Finally, a set of candidate data points for query points in Q C is Ω(q 1 )={p 1 , p 2 , p 3 }. At the other border points v 1 and v 2 , we can similarly evaluate kFN queries. Two FNs of v 1 are p 2 and p 3 , as illustrated in Figures 7 and 8. Thus, a set of candidate data points at v 1 is Ω(v 1 )={p 1 , p 2 , p 3 } because the sentinel distance for v 1 is sntl_dist(v 1 ) = 15. Similarly, two FNs of v 2 are p 1 and p 2 , as illustrated in Figures 9 and 10. Thus, a set of candidate data points at v 2 is Ω(v 2 )={p 1 , p 2 , p 3 } because the sentinel distance for v 2 is sntl_dist(v 2 ) = 18. Table 5 summarizes sets of candidate data points for border points q 1 , v 1 , and v 2 and their sentinel distance. Table 5. Results of kFN queries at q 1 , v 1 , and v 2 and their sentinel distances.

Evaluating an Example kFN Join Query
The CNLJ algorithm retrieves k FNs for each query point in Q C among the candidate data points in Ω(B(Q C )). This example kFN join query requires two FNs for each query point, i.e., k = 2, and the set of candidate data points is Ω(B(Q C ))={p 1 , p 2 , p 3 }. Each of q 1 , q 2 , q 3 , and q 4 retrieves its two FNs among candidate data points p 1 , p 2 , and p 3 . Two FNs of q 1 are first determined, two FNs of q 2 are next determined, and so on. Let us find two FNs for q 1 among the candidate data points p 1 , p 2 , and p 3 . The distances from q 1 to p 1 , p 2 , and p 3 should be computed using the fact that the shortest path from q 1 to a candidate data point should pass through a border point b q . As a result, the length of the shortest path from q 1 to p 1 is equal to dist(q 1 , Thus, p 2 and p 3 are two FNs for q 1 whose result set is Ω(q 1 ) = {p 2 , p 3 }. Next, two FNs for q 2 are retrieved among candidate data points p 1 , p 2 , and p 3 . For this, the distances from q 2 to p 1 , p 2 , and p 3 should be computed. As shown in Table 6, the distances from q 2 to p 1 , p 2 , and p 3 evaluate to dist(q 2 , p 1 ) = 25, dist(q 2 , p 2 ) = 26, and dist(q 2 , p 3 ) = 25, respectively. Thus, p 1 and p 2 are two FNs for q 2 , whose result set is Ω(q 2 ) = {p 1 , p 2 }. Similarly, two FNs for q 3 and q 4 can be retrieved among candidate data points p 1 , p 2 , and p 3 . Table 6 computes the distance from a query point q to a candidate data point p and retrieves two FNs for q among candidate data points where q ∈ {q 1 , q 2 , q 3 , q 4 } and p ∈ {p 1 , p 2 , p 3 }. Finally, the kFN join result is the union set of the kFN query results for query points in Q as follows:

Performance Evaluation
The CNLJ algorithm and its competitors are compared empirically in this section under a variety of conditions. Section 5.1 describes the experimental conditions, and Section 5.2 reports the results of the experiment. Table 7 describes two real-world roadmaps [40] that were used in the experiments. These real-world roadmaps have different sizes and are part of the United States' road network. For convenience, the data universe was normalized to a unit square of the plane. The query and data points were generated to mimic the highly skewed distributions of POIs in the real world. Firstly, the centroids c 1 , c 2 , · · ·, c m were randomly chosen inside the data universe, where m indicates the total number of centroids and varies between 1 and 10. The query and data points around each centroid displayed a normal distribution, with the mean indicating the centroid and the standard deviation set to σ = 10 −2 . Table 8 shows the experimental parameters settings. We varied a single parameter within the range in each experiment while maintaining the other parameters at the bold default values.  The baseline algorithm, which is a nonclustering join algorithm for sequentially computing the k FNs of each query point in Q, was used as a benchmark for evaluating the CNLJ algorithm. We implemented and evaluated two versions of our proposed solution, i.e., CNLJ NV and CNLJ OPT . The naive version called CNLJ NV of the CNLJ algorithm groups query points into query segments, as illustrated in Figure 3a. Thus, CNLJ NV evaluates at most two kFN queries for a query segment. The optimized version called CNLJ OPT of the CNLJ algorithm groups query points into query clusters using the two-step clustering method, as illustrated in Figure 3b. Note that the source codes for empirical evaluation of this study can be accessed via the GitHub site at https://github.com/Hyung-Ju-Cho/ (accessed on 8 February 2021). In the Microsoft Visual Studio 2019 development environment, all join algorithms were implemented in C++. All of the algorithms' common subroutines were reused for similar tasks. Experiments were conducted on a desktop computer running the Windows 10 operating system with 32 GB RAM and an 8-core processor (i9-9900) at 3.1 GHz. As in several existing studies [36,41] for online map services, this empirical study assumes that all of the algorithms' indexing structures remain in the main memory to evaluate kFN join queries quickly. The average time required to answer kFN join queries was calculated through repeated experiments using kFN join queries. Finally, we computed the network distance between two points quickly using the TNR method [42]. This is because the TNR method is easy to implement and demonstrates performance comparable to the other shortest distance algorithms [38,41,[43][44][45].

Experimental Results
The proposed CNLJ OPT , CNLJ NV , and baseline algorithms in the NA roadmap are compared in Figure 12. Each chart shows the kFN join query processing time and the number of kFN queries required to evaluate the kFN join query. The numbers of kFN queries required by the CNLJ OPT , CNLJ NV , and baseline algorithms to answer the kFN join query are shown in parentheses in Figures 12-14. Note that the CNLJ OPT algorithm evaluates kFN queries at border points of query clusters, the CNLJ NV algorithm evaluates kFN queries at end points of query segments, and the baseline algorithm evaluates kFN queries at query points. As a result, the baseline algorithm evaluates the same number of kFN queries as the number |Q| of query points in Q. Figure 12a shows the query processing times of the CNLJ OPT , CNLJ NV , and baseline algorithms when the number of the query points changes between 1000 and 5000, i.e., 1000 ≤ |Q| ≤ 5000. In all cases in |Q|, the CNLJ OPT algorithm is faster than the CNLJ NV and baseline algorithms. When |Q| = 5000, the CNLJ OPT , CNLJ NV , and baseline algorithms evaluate 281, 471, and 5000 kFN queries, respectively, and thus the CNLJ OPT algorithm is 1.2 and 36.7 times faster than the CNLJ NV and baseline algorithms, respectively. Figure 12b shows the query processing times when the number of data points changes from 1000 to 5000, i.e., 1000 ≤ |P| ≤ 5000. Regardless of the |P| value, the CNLJ OPT , CNLJ NV , and baseline algorithms evaluate 58, 96, and 1000 kFN queries, respectively. Thus, when |P| = 3000, the CNLJ OPT algorithm outperforms the CNLJ NV and baseline algorithms by 1.2 and 15.6 times, respectively. Figure 12c shows the query processing times when the number of FNs required changes between 1 and 16, i.e., 1 ≤ k ≤ 16. For all cases in k, the CNLJ OPT algorithm outperforms the CNLJ NV and baseline algorithms by 1.2 and 13.4 times, respectively. The CNLJ OPT , CNLJ NV , and baseline algorithms' query processing times are not affected by the k value. This is because the kFN query evaluation computes the distances from a query point to data clusters, regardless of the k value, and then sorts the data clusters using the distances to the query point. Figure 12d shows the query processing times when the number of centroids for query points in Q changes between 1 and 10, i.e., 1 ≤ |C Q | ≤ 10. As the |C Q | value increases, the difference between the query processing times of all algorithms decreases. The CNLJ OPT algorithm is 13.3, 1.3, 1.6, 1.7, and 1.1 times faster than the baseline algorithm when |C Q | =1, 3, 5, 7, and 10, respectively. The reason is that as the |C Q | value increases, the query points are widely dispersed and the number of query clusters increases, slowing down the CNLJ OPT algorithm's query processing time. Figure 12e shows the query processing times when the number of centroids for data points in P changes between 1 and 10, i.e., 1 ≤ |C P | ≤ 10. The kFN query processing time increases with the |C P | value. This is because as the |C P | value increases, the data points are widely dispersed and the number of data clusters to be examined by the kFN queries also increases. To summarize, the CNLJ OPT algorithm outperforms the CNLJ NV and baseline algorithms in all the cases. This confirms that the CNLJ OPT algorithm benefits from clustering of nearby query points and quickly retrieving candidate data points at once for those query points.  Figure 12. Comparison of kFN join query processing times for the NA roadmap: (a) 10 3 ≤ |Q| ≤ 5 × 10 3 ; (b) 10 3 ≤ |P| ≤ 5 × 10 3 ; (c) 1 ≤ k ≤ 16; (d) 1 ≤ C Q ≤ 10; (e) 1 ≤ |C P | ≤ 10. In the SJ roadmap, Figure 13 compares the performance of the CNLJ OPT , CNLJ NV , and baseline algorithms. Note that the experimental results using the SJ roadmap exhibit similar performance patterns to those using the NA roadmap. Figure 13a shows the query processing times when 1000 ≤ |Q| ≤ 5000. The CNLJ OPT algorithm is 1.2 and 6.0 times faster than the CNLJ NV and baseline algorithms when |Q| = 5000, respectively. Figure 13b shows the query processing times when 1000 ≤ |P| ≤ 5000. The CNLJ OPT algorithm is 1.2 and 4.5 times faster than the CNLJ NV and baseline algorithms when |P| = 4000, respectively. The |P| value increases the query processing times of all algorithms. Figure 13c shows the query processing times when 1 ≤ k ≤ 16. The CNLJ OPT algorithm is 1.2 and 3.5 times faster than the CNLJ NV and baseline algorithms, respectively. The query processing times are nearly constant regardless of the k value. Figure 13d shows the query processing times when 1 ≤ |C Q | ≤ 10. The CNLJ OPT algorithm is 3.5, 2.4, 1.8, 1.4, and 1.4 times faster than the baseline algorithm when |C Q | = 1, 3, 5, 7, and 10, respectively. The distribution of query points has an impact on the query processing time of the CNLJ OPT algorithm, as shown by this result. The query processing time of the CNLJ OPT algorithm increases with the number of query clusters because the query points are widely dispersed. Figure 13e shows the query processing times when 1 ≤ |C P | ≤ 10. The CNLJ OPT algorithm is 2.8, 4.0, 3.5, 2.9, and 3.0 times faster than the baseline algorithm when |C P | = 1, 3, 5, 7, and 10, respectively.  Figure 13. Comparison of kFN join query processing times for the SJ roadmap: (a) 10 3 ≤ |Q| ≤ 5 × 10 3 ; (b) 10 3 ≤ |P| ≤ 5 × 10 3 ; (c) 1 ≤ k ≤ 16; (d) 1 ≤ C Q ≤ 10; (e) 1 ≤ |C P | ≤ 10. Figure 14 compares the performance of the CNLJ OPT , CNLJ NV , and baseline algorithms while the numbers of query and data points change between 1000 and 10,000, i.e., 1000 ≤ |Q| ≤ 10,000 and 1000 ≤ |P| ≤ 10,000, to verify the scalability of the CNLJ OPT algorithm. As shown in Figure 14a,c, the CNLJ OPT algorithm runs faster than the CNLJ NV and baseline algorithms for all cases in |Q|. The performance difference between them typically increases with |Q|. Specifically, when |Q| = 10,000, the CNLJ OPT algorithm runs 36.6 and 5.3 times faster than the baseline algorithm for NA and SJ roadmaps, respectively. As shown in Figure 14b,d, the CNLJ OPT algorithm runs faster than the CNLJ NV and baseline algorithms for all cases in |P|. Specifically, when |P| = 10,000, the CNLJ OPT algorithm runs 6.4 and 3.0 times faster than the baseline algorithm for NA and SJ roadmaps, respectively. The experimental results confirm that the CNLJ OPT algorithm scales better with both |Q| and |P| than the CNLJ NV and baseline algorithms.

Discussion and Conclusion
The kFN join query retrieves a pair of each query point in Q with its k FNs in P, given a positive integer k, a set of query points Q, and a set of data points P. The kFN join query has various real-life applications including recommender systems and computational geometry [6][7][8][9][10][11][12][13][14]. In particular, efficient processing of kFN join queries can aid in selecting a facility's location that is farthest away from unpleasant facilities such as garbage incinerators, crematoriums, and chemical plants. A cluster nested loop join (CNLJ) algorithm was constructed in this study to efficiently answer kFN join queries for spatial networks. To the best of our knowledge, this is the first attempt to study kFN join queries in spatial networks. The CNLJ algorithm converts query points (data points) into query clusters (data clusters). It then retrieves candidate data points for clustered query points all at once, eliminating the need to search for candidate data points for each query point separately. Using real-life roadmaps in various conditions, the query processing times of the CNLJ and conventional join algorithms were empirically compared. The experimental results demonstrated that the CNLJ algorithm runs up to 50.8 times faster than the conventional join algorithms and that the CNLJ algorithm also better scales with the numbers of both data and query points than the conventional join algorithms. Unfortunately, the CNLJ algorithm shows similar performance to the conventional join algorithms, particularly when query points are uniformly located in the region. We intend to apply the proposed solution to various fields in the future. When the dataset does not fit in the main memory, we will first create index structures on the external memory. Second, we will conduct an empirical study to simulate real-life scenarios using real datasets. Third, we will improve the CNLJ algorithm for the efficient processing of kFN joins over query points that are uniformly scattered in the region.