Next Article in Journal
Influences of Built Environment at Residential and Work Locations on Commuting Distance: Evidence from Wuhan, China
Previous Article in Journal
Copernicus User Uptake: From Data to Applications
Previous Article in Special Issue
Time-Series-Based Queries on Stable Transportation Networks Equipped with Sensors
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Cluster Nested Loop k-Farthest Neighbor Join Algorithm for Spatial Networks

Department of Software, Kyungpook National University, 2559 Gyeongsang-daero, Sangju-si 37224, Korea
ISPRS Int. J. Geo-Inf. 2022, 11(2), 123; https://doi.org/10.3390/ijgi11020123
Submission received: 4 December 2021 / Revised: 20 January 2022 / Accepted: 30 January 2022 / Published: 9 February 2022
(This article belongs to the Special Issue Spatio-Temporal and Constraint Databases)

Abstract

:
This paper considers k-farthest neighbor (kFN) join queries in spatial networks where the distance between two points is the length of the shortest path connecting them. Given a positive integer k, a set of query points Q, and a set of data points P, the kFN join query retrieves the k data points farthest from each query point in Q. There are many real-life applications using kFN join queries, including artificial intelligence, computational geometry, information retrieval, and pattern recognition. However, the solutions based on the Euclidean distance or nearest neighbor search are not suitable for our purpose due to the difference in the problem definition. Therefore, this paper proposes a cluster nested loop join (CNLJ) algorithm, which clusters query points (data points) into query clusters (data clusters) and reduces the number of kFN queries required to perform the kFN join. An empirical study was performed using real-life roadmaps to confirm the superiority and scalability of the CNLJ algorithm compared to the conventional solutions in various conditions.

1. Introduction

In this study, we investigate the efficient processing of k-farthest neighbor (kFN) join queries in spatial networks where the distance between two points is defined by the length of the shortest path connecting them. The kFN join combines each query point q in Q with the k data points in P that are farthest from the query point q, given a positive integer k, a set of query points Q, and a set of data points P. The kFN join query has real-life applications in recommender systems, where farthest neighbors can increase the variety of recommendations [1,2]. Farthest neighbor search is also an element in clustering applications [3], complete linkage clustering [4], and nonlinear dimensionality reduction algorithms [5]. Thus, being able to quickly process kFN join queries is an important practical concern for many applications [6,7,8,9,10,11,12,13,14].
Figure 1 shows an example of the kFN join between a set Q of query points and a set P of data points in a spatial network, where it is assumed that k = 1 , Q = { q 1 , q 2 , q 3 } , and P = { p 1 , p 2 , p 3 , p 4 } are given. In this paper, the kFN join is denoted as Q k F N P . In this example, the data points farthest away from q 1 , q 2 , and q 3 are p 2 , p 2 , and p 3 , respectively, which can be represented by Q k F N P = { q 1 , p 2 , q 2 , p 2 , q 3 , p 3 } . Conversely, the query points farthest from p 1 , p 2 , p 3 , and p 4 are q 1 , q 1 , q 3 , and q 1 , respectively, which can be represented by P k F N Q = { p 1 , q 1 , p 2 , q 1 , p 3 , q 3 , p 4 , q 1 } . This simply proves that kFN joins are not commutative, i.e., Q k F N P P k F N Q . Note that this study considers Q k F N P . The facility location problem, which determines the competitive location of a new facility, such as garbage incinerators, crematoriums, chemical plants, supermarkets, and police stations, is very important in real life when using the kFN join query applications. Particularly, determining the optimal facility location is still an open problem [15,16]. Facing such a research problem, efficiently evaluating the kFN join query is remarkably useful. Assume that query points q 1 through q 3 represent unpleasant facilities such as garbage incinerators and chemical plants, whereas data points p 1 through p 4 represent available rental apartments. This example may consider a FN join between a set Q of unpleasant facilities and a set P of available rental apartments, which could be “find ordered pairs of an unpleasant facility q and available rental apartment p such that the rental apartment p is farther from the unpleasant facility q than the other rental apartments available.” Naturally, p 2 or p 3 may be the competitive apartment in terms of the distance to unpleasant facilities.
The kFN join query should repeatedly compute the distances between each pair of data and query points, which leads to a long query processing time. A simple solution to the kFN join query between a query set Q and a dataset P repeatedly scans all data points in P for each query point in Q to compute the distance between each pair of query and data points q , p . This simple solution is unacceptable in most cases because it repeatedly retrieves candidate data points for each query point. It may, however, be considered in cases where query points are uniformly distributed throughout the region. However, kFN join queries have not received adequate attention for spatial networks, despite their importance. This paper proposes a cluster nested loop join (CNLJ) algorithm for spatial networks to solve the problem of efficiently processing kFN join queries. Specifically, using the spatial network connection, query points (data points) are clustered into query clusters (data clusters). The CNLJ algorithm exploits a shared computation for query clusters to avoid unnecessary computations of the distances between data and query points. The CNLJ algorithm has several advantages over the traditional solution: (1) it clusters query points (data points) using the spatial network connection for the shared computation, (2) it quickly retrieves candidate data points at once for clustered query points, and (3) it does not retrieve candidate data points for each query point separately. To the best of our knowledge, this is the first attempt to study kFN join queries for spatial networks.
The primary contributions of this study are listed as follows:
  • This paper presents a cluster nested loop join algorithm for quickly evaluating spatial network kFN join queries. The CNLJ algorithm clusters query points before retrieving candidate data points for clustered query points all at once. As a result, it does not retrieve candidate data points for each query point multiple times.
  • The CNLJ algorithm’s correctness is demonstrated through mathematical reasoning. In addition, a theoretical analysis is provided to clarify the benefits and drawbacks of the CNLJ algorithm concerning query point spatial compactness.
  • An empirical study with various setups was conducted to demonstrate the superiority and scalability of the CNLJ algorithm. The CNLJ algorithm outperforms the conventional join algorithms by up to 50.8 times according to the results.
The remainder of this paper is organized as follows: Section 2 reviews related research and provides some background knowledge. Section 3 describes the clustering of query points (data points) and the computing of the maximum and minimum distances between a border point and a data cluster. Section 4 presents the CNLJ algorithm for rapidly evaluating kFN join queries in spatial networks. Section 5 presents the results of experiments using the CNLJ and conventional join algorithms with different setups. Finally, the conclusions of this study are discussed in Section 6.

2. Background

Section 2.1 presents related works and Section 2.2 defines the terms and notations used in this study.

2.1. Related Work

Many studies have considered spatial queries based on the farthest neighbor (FN) search [6,7,8,9,10,11,13,14,17,18,19,20]. Korn and Muthukrishnan [21] pioneered the concept of a reverse farthest neighbor (RFN) query to obtain the weak influence set. Given a set of data points P and a query point q, the RFN query retrieves a set of data points p P such that q is their farthest neighbor among all points in P q . This is the monochromatic RFN (MRFN) query [8,9,13,14,19]. Another version of the RFN query is the bichromatic reverse farthest neighbor (BRFN) query [10,13,14,22]. Given a set of data points P, a set of query points Q, and a query point q in Q, the BRFN query retrieves a set of data points p in P such that q is the farthest neighbor of p among all query points in Q. Many studies have been conducted to process RFN queries for the Euclidean space [8,9,14,19,22] and for spatial networks [10,13]. Yao et al. [14] proposed progressive farthest cell and convex hull farthest cell algorithms to answer RFN queries using an R-tree [23,24]. A solution to answer reverse kFN queries in the Euclidean space was presented for arbitrary values of k [22]. Liu et al. [19] proposed the concept of group RkFN query in the obstacle space and presented a query optimization algorithm based on the Voronoi graph. Tran et al. [10] proposed a solution for RFN queries and RkFN queries in road networks by using Voronoi-diagram-related attributes and Dijkstra’s algorithm. Xu et al. [13] presented efficient algorithms based on landmarks and hierarchical partitioning to process monochromatic and bichromatic RFN queries in spatial networks. The approximate version of the problem, known as the c-approximate farthest neighbor (c-AFN) search, has been actively studied due to the difficulty in designing an efficient method for exact FN search in high-dimensional space [6,17,18,25]. Huang et al. [18,25] introduced a new concept of reverse locality-sensitive hashing (LSH) family and developed reverse query-aware LSH functions. They proposed two hashing schemes for high-dimensional c-AFN search over external memory. Liu et al. [17] developed an approximate algorithm with a theoretical guarantee for high-dimensional c-AFN search over external memory. Curtin et al. [6] proposed an algorithm with an absolute approximation guarantee for the FN search in the high-dimensional space. To estimate the difficulty of the FN search problem, an information-theoretical measure of hardness was presented [6]. Farthest dominated location (FDL) queries were proposed in [26]. An FDL query retrieves the location s L such that the distance to its nearest dominating object in P is maximized given a set of data points P with spatial and nonspatial attributes, a set L of candidate locations, and a design competence vector Ψ for L. Gao et al. [7] studied aggregate k-farthest neighbor (AkFN) queries that are defined by aggregation functions, such as min, max, and sum, and presented the MB and BF algorithms based on the R tree [23,24]. An AkFN query retrieves the k data points in P with the largest aggregate distances to all query points in Q given a set of data points P and a set of query points Q. In spatial networks, effective solutions to AkFN queries were proposed [11].
Due to the differences in the properties between the shortest path distance and the Euclidean distance, existing solutions based on the Euclidean space cannot be used directly to answer kFN join queries in spatial networks. The existing solutions for nearest neighbor search [27,28,29] cannot readily be used to address the farthest neighbor search problems due to the different distance properties between farness and nearness. Although the group computation of spatial queries has received considerable attention [19,27,30,31,32,33,34], group computation has not been applied to kFN join queries for spatial networks. To efficiently process kFN join queries in spatial networks, new sophisticated algorithms must be developed. First, the kFN join is a costly operation by definition. Second, farthest neighbor search is more difficult than nearest neighbor search. Finally, designing index structures that effectively support the FN search for spatial networks is difficult. In terms of the space domain, query type, and data type, Table 1 compares our problem scenario to existing studies.

2.2. Notation and Formal Problem Description

Query and data points are placed in a spatial network G and these points represent points of interest (POIs), as shown in Figure 1. Given two points q and p, d i s t ( q , p ) is the length of the shortest path between q and p in G. Table 2 summarizes the symbols used in this study.
Definition 1.
kFN search [6,7,8,9,10,11,13,14]. Given a positive integer k, a query point q, and a set P of data points, the query point q returns a set of k data points, denoted as Ω ( q ) , such that d i s t ( q , p + ) d i s t ( q , p ) holds for p + Ω ( q ) and p P Ω ( q ) .
Definition 2.
kFN join. Given a positive integer k, a set of query points Q, and a set of data points P, the kFN join query, denoted as Q k F N P , returns ordered pairs of each query point q in Q and a set of k data points farthest from q. For simplicity, Q k F N P is abbreviated to Q P , which is formally defined by Q P = { q , Ω ( q ) | q Q } . Note that the kFN joins are not commutative, i.e., Q P P Q .
Definition 3.
Spatial network [32,33,36,37,38]. A weighted undirected graph G = V , E , W is used to represent a spatial network, where V, E, and W represent the vertex set, edge set, and edge distance matrix, respectively. Each edge has a non-negative weight that indicates the network distance.
Definition 4.
Intersection, intermediate, and terminal vertices. Vertices can be divided into three categories based on their degree: (1) If the degree of a vertex is larger than or equal to 3, the vertex is referred to as an intersection vertex. (2) If the degree is 2, the vertex is an intermediate vertex. (3) If the degree is 1, the vertex is a terminal vertex.
Definition 5.
Vertex sequence, query segment, and data segment. A vertex sequence v l v l + 1 v m ¯ denotes a path between two vertices, v l and v m , such that v l and v m are either an intersection vertex or a terminal vertex, and the other vertices in the path, v l + 1 , , v m 1 , are intermediate vertices. A query segment q i q i + 1 q j ¯ denotes a line segment connecting query points q i , q i + 1 , , q j and a data segment p l p l + 1 p m ¯ denotes a line segment connecting data points p l , p l + 1 , , p m . For simplicity, q i q j ¯ and p l p m ¯ are abbreviated to q i q i + 1 q j ¯ and p l p l + 1 p m ¯ , respectively, to reduce confusion.

3. Clustering Points and Computing Distances

In Section 3.1, we group query points (data points) by using the spatial network connection. We calculate the maximum and minimum distances between a border point and a data cluster in Section 3.2.

3.1. Clustering Query and Data Points Using Spatial Network Connection

Figure 2 illustrates an example of the kFN join Q P , where k = 2 , Q = { q 1 , q 2 , q 3 , q 4 } , and P = { p 1 , p 2 , , p 6 } are given. The example kFN join query requires that each query point q in Q finds two data points farthest from q.
Figure 3 shows an example of the two-step clustering method to group nearby query points into query clusters. A query segment is created in the first step by connecting query points in a vertex sequence. In Figure 3a, query points q 1 and q 2 in a vertex sequence q 1 q 2 v 2 ¯ are connected to become q 1 q 2 ¯ . Thus, three query segments q 1 q 2 ¯ , q 3 , and q 4 are generated, as shown in Figure 3a. In the second step, an intersection vertex is used to connect adjacent query segments to form a query cluster. In Figure 3b, the intersection vertex q 1 connects two query segments q 1 q 2 ¯ and q 4 . Similarly, q 3 and q 4 are linked by the intersection vertex v 1 . Finally, q 1 q 2 ¯ and q 3 are linked by the intersection vertex v 2 . As a result, three query segments q 1 q 2 ¯ , q 3 , and q 4 are linked to form a query cluster { q 1 q 2 v 2 ¯ , q 1 q 4 v 1 ¯ , v 1 q 3 v 2 ¯ } . Note that a query cluster is a set of query segments. Naturally, a set of query points Q = { q 1 , q 2 , q 3 , q 4 } is converted into a set of query clusters Q ¯ = { { q 1 q 2 v 2 ¯ , q 1 q 4 v 1 ¯ , v 1 q 3 v 2 ¯ } } . Let us define a border point for a query cluster Q C ¯ . When a query cluster Q C ¯ and its nonquery cluster G Q C ¯ meet at a point, that point is referred to as the border point of Q C ¯ . In this example, three border points q 1 , v 1 , and v 2 are found for Q C ¯ = { q 1 q 2 v 2 ¯ , q 1 q 4 v 1 ¯ , v 1 q 3 v 2 ¯ } . Thus, a set of border points of Q C ¯ is represented by B ( Q C ¯ ) = { q 1 , v 1 , v 2 } .
Figure 4 shows an example of a two-step clustering method to group neighboring data points into data clusters. Notably, the query and data points are clustered using the same two-step method. In the first step, data points p 1 , p 2 , and p 3 in a vertex sequence v 1 p 2 v 3 ¯ are connected to become a data segment p 1 p 2 p 3 ¯ . Similarly, data points p 4 and p 5 in a vertex sequence p 5 p 4 q 1 ¯ are linked to form a data segment p 4 p 5 ¯ . As a result, three data segments p 1 p 2 p 3 ¯ , p 4 p 5 ¯ , and p 6 are generated, as illustrated in Figure 4a. In the second step, two data segments p 4 p 5 ¯ and p 6 are joined by an intersection vertex p 5 to form a data cluster { p 4 p 5 ¯ , p 5 p 6 ¯ } . As a result, a set of data points P = { p 1 , p 2 , , p 6 } is transformed into a set of data clusters P ¯ = { { p 1 p 2 p 3 ¯ } , { p 4 p 5 ¯ , p 5 p 6 ¯ } } .

3.2. Computing Maximum and Minimum Distances from a Border Point to a Data Cluster

The maximum and minimum distances between a border point b q and a data cluster P C ¯ are computed in this section. The minimum and maximum distances between b q and P C ¯ are formally defined by m i n d i s t ( b q , P C ¯ ) = m i n { d i s t ( b q , p ) | p P C ¯ } and m a x d i s t ( b q , P C ¯ ) = m a x { d i s t ( b q , p ) | p P C ¯ } , respectively. The minimum distance between b q and P C ¯ can be easily calculated by m i n d i s t ( b q , P C ¯ ) = m i n { d i s t ( b q , b p ) | b p B ( P C ¯ ) } where b p is a border point of a data cluster P C ¯ . The maximum distance between b q and P C ¯ can be represented by m a x d i s t ( b q , P C ¯ ) = m a x { m a x d i s t ( b q , p l p m ¯ ) | p l p m ¯ P C ¯ } where m a x d i s t ( b q , p l p m ¯ ) is the maximum distance between b q and a data segment p l p m ¯ in P C ¯ .
An example is used to illustrate how to compute the maximum and minimum distances between a border point b q and a data cluster P C ¯ . Note that the example kFN join query has three border points and two data clusters, i.e., B ( Q C ¯ ) = { q 1 , v 1 , v 2 } and P ¯ = { { p 1 p 2 p 3 ¯ } , { p 4 p 5 ¯ , p 5 p 6 ¯ } } . In this section, the maximum and minimum distances between b q and P C ¯ are computed, where b q { q 1 , v 1 , v 2 } and P C ¯ { { p 1 p 2 p 3 ¯ } , { p 4 p 5 ¯ , p 5 p 6 ¯ } } . Note that computations of m a x d i s t ( q 1 , { p 1 p 2 p 3 ¯ } ) , m a x d i s t ( q 1 , { p 4 p 5 ¯ , p 5 p 6 ¯ } ) , m a x d i s t ( v 1 , { p 1 p 2 p 3 ¯ } ) , m a x d i s t ( v 1 , { p 4 p 5 ¯ , p 5 p 6 ¯ } ) , m a x d i s t ( v 2 , { p 1 p 2 p 3 ¯ } ) , and m a x d i s t ( v 2 , { p 4 p 5 ¯ , p 5 p 6 ¯ } ) are illustrated in Figure 5Figure 6Figure 7Figure 8Figure 9 and Figure 10, respectively.
Figure 5 illustrates the computation of m a x d i s t ( q 1 , { p 1 p 2 p 3 ¯ } ) . First, the distances from q 1 to the endpoints p 1 and p 3 of p 1 p 2 p 3 ¯ evaluate to d i s t ( q 1 , p 1 ) = 24 and d i s t ( q 1 , p 3 ) = 27 , respectively. Consider a point p in p 1 p 2 p 3 ¯ . Because p lies in p 1 p 2 p 3 ¯ , whose length is l e n ( p 1 p 2 p 3 ¯ ) = 5 , the distance between q 1 and p is computed by d i s t ( q 1 , p ) = m i n { d i s t ( q 1 , p 1 ) + l e n ( p 1 p ¯ ) , d i s t ( q 1 , p 3 ) + l e n ( p 3 p ¯ ) } = m i n { 24 + l e n ( p 1 p ¯ ) , 27 + l e n ( p 3 p ¯ ) } . Let x = l e n ( p 1 p ¯ ) for 0 x 5 . Then, we have l e n ( p 3 p ¯ ) = 5 x because l e n ( p 1 p ¯ ) + l e n ( p 3 p ¯ ) = 5 . We can rewrite d i s t ( q 1 , p ) = m i n { 24 + x , 27 + ( 5 x ) } for 0 x 5 . As shown in Figure 5, the maximum and minimum distances between q 1 and { p 1 p 2 p 3 ¯ } is m a x d i s t ( q 1 , { p 1 p 2 p 3 ¯ } ) = 28 and m i n d i s t ( q 1 , { p 1 p 2 p 3 ¯ } ) = 24 , respectively. For convenience, the star symbol (★) in Figure 5 is marked to indicate m a x d i s t ( q 1 , { p 1 p 2 p 3 ¯ } ) = 28 .
The maximum distance between a border point q 1 and a data cluster { p 4 p 5 ¯ , p 5 p 6 ¯ } is represented by m a x d i s t ( q 1 , { p 4 p 5 ¯ , p 5 p 6 ¯ } ) = m a x { m a x d i s t ( q 1 , p 4 p 5 ¯ ) , m a x d i s t ( q 1 , p 5 p 6 ¯ ) } . The computations of m a x d i s t ( q 1 , p 4 p 5 ¯ ) and m a x d i s t ( q 1 , p 5 p 6 ¯ ) are illustrated in Figure 6a,b, respectively. The distances from q 1 to the endpoints p 4 and p 5 are d i s t ( q 1 , p 4 ) = 5 and d i s t ( q 1 , p 5 ) = 8 , respectively. The maximum and minimum distances from q 1 to p 4 p 5 ¯ are shown in Figure 6a as m a x d i s t ( q 1 , p 4 p 5 ¯ ) = 8 and m i n d i s t ( q 1 , p 4 p 5 ¯ ) = 5 , respectively. The distances from q 1 to the endpoints p 5 and p 6 of p 5 p 6 ¯ are d i s t ( q 1 , p 5 ) = 8 and d i s t ( q 1 , p 6 ) = 11 , respectively. The maximum and minimum distances from q 1 to p 5 p 6 ¯ are calculated to be m a x d i s t ( q 1 , p 5 p 6 ¯ ) = 11 and m i n d i s t ( q 1 , p 5 p 6 ¯ ) = 8 , respectively, as shown in Figure 6b. Therefore, the maximum and minimum distances between q 1 and { p 4 p 5 ¯ , p 5 p 6 ¯ } are m a x d i s t ( q 1 , { p 4 p 5 ¯ , p 5 p 6 ¯ } ) = 11 and m i n d i s t ( q 1 , { p 4 p 5 ¯ , p 5 p 6 ¯ } ) = 5 , respectively.
Figure 7 illustrates the computation of m a x d i s t ( v 1 , { p 1 p 2 p 3 ¯ } ) and m i n d i s t ( v 1 , { p 1 p 2 p 3 ¯ } ) . The distances from v 1 to the endpoints p 1 and p 3 of p 1 p 2 p 3 ¯ are d i s t ( v 1 , p 1 ) = 19 and d i s t ( v 1 , p 3 ) = 24 , respectively. Thus, the maximum and minimum distances between v 1 and { p 1 p 2 p 3 ¯ } are m a x d i s t ( v 1 , { p 1 p 2 p 3 ¯ } ) = 24 and m i n d i s t ( v 1 , { p 1 p 2 p 3 ¯ } ) = 19 , respectively.
Figure 8 illustrates the computation of m a x d i s t ( v 1 , { p 4 p 5 ¯ , p 5 p 6 ¯ } ) and m i n d i s t ( v 1 , { p 4 p 5 ¯ , p 5 p 6 ¯ } ) . The maximum distance between v 1 and { p 4 p 5 ¯ , p 5 p 6 ¯ } is computed by m a x d i s t ( v 1 , { p 4 p 5 ¯ , p 5 p 6 ¯ } ) = m a x { m a x d i s t ( v 1 , p 4 p 5 ¯ ) , m a x d i s t ( v 1 , p 5 p 6 ¯ ) } . The computations of m a x d i s t ( v 1 , p 4 p 5 ¯ ) and m a x d i s t ( v 1 , p 5 p 6 ¯ ) are illustrated in Figure 8a,b, respectively. The distances from v 1 to the endpoints p 4 and p 5 of p 4 p 5 ¯ are d i s t ( v 1 , p 4 ) = 10 and d i s t ( v 1 , p 5 ) = 9 , respectively. Thus, the maximum and minimum distances between v 1 and p 4 p 5 ¯ are m a x d i s t ( v 1 , p 4 p 5 ¯ ) = 11 and m i n d i s t ( v 1 , p 4 p 5 ¯ ) = 9 , respectively, as shown in Figure 8a, where the star symbol (★) is marked to indicate m a x d i s t ( v 1 , p 4 p 5 ¯ ) = 11 . The distances from v 1 to the endpoints p 5 and p 6 of p 5 p 6 ¯ are d i s t ( v 1 , p 5 ) = 9 and d i s t ( v 1 , p 6 ) = 12 , respectively. As shown in Figure 8b, the maximum and minimum distances between v 1 and p 5 p 6 ¯ are m a x d i s t ( v 1 , p 5 p 6 ¯ ) = 12 and m i n d i s t ( v 1 , p 5 p 6 ¯ ) = 9 , respectively. Thus, the maximum and minimum distances between v 1 and { p 4 p 5 ¯ , p 5 p 6 ¯ } are m a x d i s t ( v 1 , { p 4 p 5 ¯ , p 5 p 6 ¯ } ) = 12 and m i n d i s t ( v 1 , { p 4 p 5 ¯ , p 5 p 6 ¯ } ) = 9 , respectively.
Figure 9 illustrates the computation of m a x d i s t ( v 2 , { p 1 p 2 p 3 ¯ } ) and m i n d i s t ( v 2 , { p 1 p 2 p 3 ¯ } ) . The distances from v 2 to the endpoints p 1 and p 3 of p 1 p 2 p 3 ¯ are d i s t ( v 2 , p 1 ) = d i s t ( v 2 , p 3 ) = 23 , respectively. Thus, the maximum and minimum distances between v 2 and { p 1 p 2 p 3 ¯ } are m a x d i s t ( v 2 , { p 1 p 2 p 3 ¯ } ) = 25.5 and m i n d i s t ( v 2 , { p 1 p 2 p 3 ¯ } ) = 23 , respectively. Note that the star symbol (★) in Figure 9 is marked to indicate m a x d i s t ( v 2 , { p 1 p 2 p 3 ¯ } ) = 25.5 .
Figure 10 illustrates the computation of m a x d i s t ( v 2 , { p 4 p 5 ¯ , p 5 p 6 ¯ } ) and m i n d i s t ( v 2 , { p 4 p 5 ¯ , p 5 p 6 ¯ } ) . The maximum distance between v 2 and { p 4 p 5 ¯ , p 5 p 6 ¯ } is computed by m a x d i s t ( v 2 , { p 4 p 5 ¯ , p 5 p 6 ¯ } ) = m a x { m a x d i s t ( v 2 , p 4 p 5 ¯ ) , m a x d i s t ( v 2 , p 5 p 6 ¯ ) } . The computations of m a x d i s t ( v 2 , p 4 p 5 ¯ ) and m a x d i s t ( v 2 , p 5 p 6 ¯ ) are illustrated in Figure 10a,b, respectively. The distances from v 2 to the endpoints p 4 and p 5 of p 4 p 5 ¯ are d i s t ( v 2 , p 4 ) = 8 and d i s t ( v 2 , p 5 ) = 5 , respectively. Thus, the maximum and minimum distances between v 2 and p 4 p 5 ¯ are m a x d i s t ( v 2 , p 4 p 5 ¯ ) = 8 and m i n d i s t ( v 2 , p 4 p 5 ¯ ) = 5 , respectively, as shown in Figure 10a. The distances from v 2 to the endpoints p 5 and p 6 of p 5 p 6 ¯ are d i s t ( v 2 , p 5 ) = 5 and d i s t ( v 2 , p 6 ) = 8 , respectively. Thus, the maximum and minimum distances between v 2 and p 5 p 6 ¯ are m a x d i s t ( v 2 , p 5 p 6 ¯ ) = 8 and m i n d i s t ( v 2 , p 5 p 6 ¯ ) = 5 , respectively, as shown in Figure 10b. Thus, the maximum and minimum distances between v 2 and { p 4 p 5 ¯ , p 5 p 6 ¯ } are m a x d i s t ( v 2 , { p 4 p 5 ¯ , p 5 p 6 ¯ } ) = 8 and m i n d i s t ( v 2 , { p 4 p 5 ¯ , p 5 p 6 ¯ } ) = 5 , respectively.
Table 3 summarizes the maximum and minimum distances between the border points in B ( Q C ¯ ) and the data clusters in P ¯ .

4. Cluster Nested Loop Join Algorithm for Spatial Networks

The CNLJ algorithm is described in Section 4.1. Section 4.2 evaluates k FNs queries at the border points of query clusters. Finally, the example k FNs join query is evaluated in Section 4.3.

4.1. Cluster Nested Loop Join Algorithm

The CNLJ algorithm is described in Algorithm 1, which involves two steps. The two-step clustering method (lines 2–4), which is described in Section 3.1, is used to group nearby query points (data points) into query clusters (data clusters) in the first step. In the second step, the kFN join is performed for each query cluster in Q ¯ (lines 5–8). Finally, the kFN join result Ω ( Q ) is returned to a query user when the kFN join is complete for each query cluster in Q ¯ (line 9).
Algorithm 1 CNLJ( k , Q , P ).
Input:k: number of FNs for q, Q: set of query points, and P: set of data points
Output: Ω ( Q ) : Set of ordered pairs of each query point q in Q and a set of k FNs for q, i.e., Ω ( Q ) = { q , Ω ( q ) | q Q } .
 1:
Ω ( Q )                              // The result set Ω ( Q ) is initialized to the empty set.
 2:
// Step 1: Query and data points are clustered, which is presented in Section 3.1.
 3:
Q ¯ t w o _ s t e p _ c l u s t e r i n g ( Q )                   // Query points are grouped into query clusters.
 4:
P ¯ t w o _ s t e p _ c l u s t e r i n g ( P )                    // Data points are grouped into data clusters.
 5:
// Step 2: The kFN join is performed for each query cluster in Q ¯ , which is presented in Algorithm 2.
 6:
for each query cluster Q C ¯ Q ¯ do
 7:
    Ω ( Q C ¯ ) k F N _ j o i n ( k , Q C ¯ , P ¯ )                          // Ω ( Q C ¯ ) = { q , Ω ( q ) | q Q C ¯ } .
 8:
    Ω ( Q ) Ω ( Q ) Ω ( Q C ¯ )                       // the kFN join result for Q C ¯ is added to Ω ( Q ) .
 9:
return Ω ( Q )                 // Ω ( Q ) is returned once the kFN join for every query cluster in Q ¯ is complete.
Algorithm 2 describes the kFN join algorithm for a query cluster Q C ¯ . First, kFN queries are evaluated at the border points of Q C ¯ to collect the candidate data points for query points in Q C ¯ (lines 4–7), which is described in Algorithm 3. Then, each query point q in Q C ¯ retrieves the kFNs for q among the candidate data points in Ω ( B ( Q C ¯ ) ) (lines 8–11), which is detailed in Algorithm 4. Finally, the kFN join result Ω ( Q C ¯ ) for query points in Q C ¯ is returned after each query point q in Q C ¯ retrieves the kFNs for q from the candidate data points (line 12).
Algorithm 2 k F N _ j o i n ( k , Q C ¯ , P ¯ ) .
Input:k: number of FNs for q, Q C ¯ : query cluster, and P ¯ : set of data clusters
Output: Ω ( Q C ¯ ) : Set of ordered pairs of each query point q in Q C ¯ and a set of k FNs for q, i.e., Ω ( Q C ¯ ) = { q , Ω ( q ) | q Q C ¯ }
 1:
Ω ( Q C ¯ )                     // The result set for query points in Q C ¯ is initialized to the empty set.
 2:
Ω ( B ( Q C ¯ ) )                         // Note that Ω ( B ( Q C ¯ ) ) = { b q , Ω ( b q ) | b q B ( Q C ¯ ) } .
 3:
l m a x { d i s t ( b q i , b q j ) | b q i , b q j B ( Q C ¯ ) }        // l indicates the maximum distance between border points in Q C ¯ .
 4:
// Step 1: kFN query is evaluated at each border point b q in Q C ¯
 5:
for each border point b q B ( Q C ¯ ) do
 6:
    Ω ( b q ) f i n d _ c a n d i d a t e s ( k , l , b q , P ¯ )        // kFN query is evaluated at b q , which is detailed in Algorithm 3.
 7:
    Ω ( B ( Q C ¯ ) ) Ω ( B ( Q C ¯ ) ) Ω ( b q )             // Ω ( B ( Q C ¯ ) ) collects candidate data points for query points in Q C ¯ .
 8:
// Step 2: Each query point q retrieves k FNs among the candidate data points in Ω ( B ( Q C ¯ ) ) .
 9:
for each query point q Q C ¯ do
 10:
    Ω ( q ) r e t r i e v e _ k F N ( k , q , Ω ( B ( Q C ¯ ) ) )           // Ω ( B ( Q C ¯ ) ) is the set of candidate data points for q.
 11:
    Ω ( Q C ¯ ) Ω ( Q C ¯ ) Ω ( q )
 12:
return Ω ( Q C ¯ )              // Ω ( Q C ¯ ) is returned once the kFN search is performed for each query point in Q C ¯ .
Algorithm 3 describes the kFN query processing algorithm for finding candidate data points at a border point b q for a query cluster. Note that the kFN query result for b q includes candidate data points for query points in Q C ¯ . The set of kFNs for b q , Ω ( b q ) , is initialized to the empty set (line 1). The third argument l indicates the maximum distance between border points in Q C ¯ , i.e., l m a x { d i s t ( b q i , b q j ) | b q i , b q j B ( Q C ¯ ) } . The sentinel distance is initialized to s n t l _ d i s t 0 and determines whether a data point p is a candidate point for Q C ¯ . The maximum and minimum distances from b q to data clusters in P ¯ are computed, as described in Section 3.2. The data clusters are then sorted in decreasing order of their maximum distance to b q . Naturally, the sorted data clusters are explored sequentially. If the maximum distance from b q to the data cluster P C ¯ to be explored next is smaller than the sentinel distance, i.e., m a x d i s t ( b q , P C ¯ ) < s n t l _ d i s t , the remaining unexplored data clusters do not need to be considered because the data points in these data clusters can be candidate data points for no query point in Q C ¯ . Otherwise (i.e., m a x d i s t ( b q , P C ¯ ) s n t l _ d i s t ), each data point p in P C ¯ needs to be examined to determine whether p is a candidate point for query points in Q C ¯ . For this, d i s t ( b q , p ) is computed. If b q is inside P C ¯ , then the distance from b q to p is simply computed using a graph search algorithm such as Dijkstra’s algorithm [39]. Otherwise (i.e., if b q is outside P C ¯ ), the distance evaluates to d i s t ( b q , p ) m i n { d i s t ( b q , b p ) + d i s t ( b p , p ) | b p B ( P C ¯ ) } . Note that b q is a border point of Q C ¯ and b p is a border point of P C ¯ . This is because if b q is outside P C ¯ , the shortest path from b q to p should pass through a border point b p of P C ¯ , i.e., b q b p p . If d i s t ( b q , p ) s n t l _ d i s t , then p is added to Ω ( b q ) as a candidate data point for Q C ¯ . Redundant data points p may be included in Ω ( b q ) and they should be removed from Ω ( b q ) . Thus, each data point p in Ω ( b q ) is explored to verify that p is qualified to be a candidate data point, i.e., d i s t ( b q , p ) s n t l _ d i s t . If p does not satisfy the qualification, it is removed from Ω ( b q ) . Finally, if the maximum distance from b q to the data cluster P C ¯ is smaller than the sentinel distance (lines 10–12) or if every data cluster is examined, the k FN query result for b q , Ω ( b q ) , is returned.
Algorithm 3 f i n d _ c a n d i d a t e s ( k , l , b q , P ¯ ) .
Input:k: number of FNs for q, l: maximum distance between border points in Q C ¯ , b q : border point of Q C ¯ , and P ¯ : set of data clusters
Output: Ω ( b q ) : Set of k FNs for b q
 1:
Ω ( b q )                    // The set of k FNs for a border point b q , Ω ( b q ) , is initialized to the empty set.
 2:
s n t l _ d i s t 0                           // t = The sentinel distance s n t l _ d i s t is initialized to 0.
 3:
// The maximum and minimum distances from b q to data clusters in P ¯ are computed as explained in Section 3.2.
 4:
for each data cluster P C ¯ P ¯ do
 5:
    compute m a x d i s t ( b q , P C ¯ ) and m i n d i s t ( b q , P C ¯ )
 6:
// The data clusters in P ¯ are sorted in decreasing order of their maximum distance to b q
 7:
P ¯ s o r t _ d a t a _ c l u s t e r s ( P ¯ )                       // P ¯ contains the sorted data clusters for b q .
 8:
// Data clusters are explored sequentially.
 9:
for each sorted data cluster P C ¯ P ¯ do
 10:
   if  m a x d i s t ( b q , P C ¯ ) < s n t l _ d i s t then
 11:
       // Note that s n t l _ d i s t is updated as shown in line 24.
 12:
     Go to line 26                    // This means that the other data clusters do not need to be explored.
 13:
   // Each data point p in P C ¯ is sequentially explored to find k FNs for b q .
 14:
   for each data point p P C ¯ do
 15:
       // d i s t ( b q , p ) is computed using the following two cases. b q P C ¯ and b q P C ¯ .
 16:
     if  b q is inside P C ¯ then
 17:
         d i s t ( b q , p ) is simply computed using a graph search algorithm such as Dijkstra’s algorithm [39]
 18:
     else
 19:
         d i s t ( b q , p ) m i n { d i s t ( b q , b p ) + d i s t ( b p , p ) | b p B ( P C ¯ ) }    // Note that the path from b q to p is b q b p p .
 20:
     // p is added to Ω ( b q ) if d i s t ( b q , p ) s n t l _ d i s t .
 21:
     if  d i s t ( b q , p ) s n t l _ d i s t then
 22:
        // Ω ( b q ) collects candidate data points for query points in Q C ¯ .
 23:
         Ω ( b q ) Ω ( b q ) { p }                                   // p is added to Ω ( b q ) .
 24:
         s n t l _ d i s t d i s t ( b q , p k t h ) l                          // p k t h is the current kth FN of b q .
 25:
// Redundant data points p are removed from Ω ( b q ) because they can be kFNs of no query point in Q C ¯
 26:
for each data point p Ω ( b q ) do
 27:
   if  d i s t ( b q , p ) < s n t l _ d i s t then
 28:
        Ω ( b q ) Ω ( b q ) { p }               // p is no candidate data point for Q C ¯ and it is removed from Ω ( b q ) .
 29:
return Ω ( b q )               // Ω ( b q ) is returned after candidate data points are collected for query points in Q C ¯ .
Algorithm 4 describes that a query point q in Q C ¯ retrieves k FNs for q among candidate data points in Ω ( B ( Q C ¯ ) ) . First, Ω ( q ) is initialized to the empty set. The distance between q and a candidate data point p is computed using the following two cases: p Q C ¯ and p Q C ¯ . If p is inside Q C ¯ , i.e., p Q C ¯ , then the distance from q to p is simply computed using a graph search algorithm [39]. Otherwise (i.e., p Q C ¯ ), the distance evaluates to d i s t ( q , p ) m i n { d i s t ( q , b q ) + d i s t ( b q , p ) | b q B ( Q C ¯ ) } . This is because the shortest path from q to p should pass through a border point b q of Q C ¯ , i.e., q b q p . When d i s t ( q , p ) is computed, the following two conditions are checked to determine whether the data point p is added to Ω ( q ) : If the cardinality of Ω ( q ) is smaller than k, i.e., | Ω ( q ) | < k , then p is simply added to Ω ( q ) . Furthermore, if p is farther from q than the current kth FN p k t h of q, i.e., d i s t ( q , p ) > d i s t ( q , p k t h ) , then p is added to Ω ( q ) and p k t h is removed from Ω ( q ) . Finally, when exploration of every candidate data point is complete, the kFN query result for q, Ω ( q ) , is returned.
Algorithm 4 r e t r i e v e _ k F N ( k , q , Ω ( B ( Q C ¯ ) ) ) .
Input:k: number of FNs for q, q: query point in Q C ¯ , Ω ( B ( Q C ¯ ) ) : set of candidate data points for q
Output: Ω ( q ) : set of k FNs for q
 1:
Ω ( q )                                       // Ω ( q ) is initialized to the empty set.
 2:
// Ω ( B ( Q C ¯ ) ) is the set of candidate data points for q.
 3:
for each candidate data point p Ω ( B ( Q C ¯ ) ) do
 4:
     if p is inside Q C ¯ then
 5:
         d i s t ( q , p ) is simply computed using a graph search algorithm like Dijkstra’s algorithm [39]
 6:
     else
 7:
         d i s t ( q , p ) m i n { d i s t ( q , b q ) + d i s t ( b q , p ) | b q B ( Q C ¯ ) }    //note that d i s t ( b q , p ) was computed in Algorithm 3.
 8:
     // p is added to Ω ( q ) if it satisfies either of the two conditions below.
 9:
     if  Ω ( q ) < k then
 10:
         Ω ( q ) Ω ( q ) { p }
 11:
     else if Ω ( q ) = k and d i s t ( q , p ) > d i s t ( q , p k t h ) then
 12:
        // note that p k t h is the current kth FN of q.
 13:
         Ω ( q ) Ω ( q ) { p } { p k t h }
 14:
return Ω ( q )
Lemma 1 proves that a query point q in a query cluster Q C ¯ can retrieve the k FNs for q among candidate data points in Ω ( B ( Q C ¯ ) ) .
Lemma 1.
Each query point q in a query cluster Q C ¯ can retrieve the k FNs for q among candidate data points in Ω ( B ( Q C ¯ ) ) .
Proof. 
Lemma 1 is proved by contradiction. For this, assume that there is a qualified data point p in Ω ( q ) and that p does not belong to Ω ( B ( Q C ¯ ) ) , i.e., p Ω ( q ) and p Ω ( B ( Q C ¯ ) ) . The qualified data point p is farther from q than the kth FN p k t h of border point b q of Q C ¯ , which means that d i s t ( q , p ) > d i s t ( q , p k t h ) . According to Algorithm 3, it holds that d i s t ( q , b q ) l and d i s t ( b q , p k t h ) d i s t ( b q , p ) > l , where l = m a x { d i s t ( b q i , b q j ) | b q i , b q j B ( Q C ¯ ) } . Thus, the distance from q to p via b q is smaller than the distance from b q to p k t h , i.e., d i s t ( q , b q ) + d i s t ( b q , p ) < d i s t ( b q , p k t h ) . This is because d i s t ( q , b q ) l and d i s t ( b q , p k t h ) > d i s t ( b q , p ) + l are given. Clearly, d i s t ( q , b q ) + d i s t ( b q , p ) < d i s t ( b q , p k t h ) implies that d i s t ( q , p ) < d i s t ( q , p k t h ) . This leads to a contradiction to the assumption that d i s t ( q , p ) > d i s t ( q , p k t h ) . Therefore, each query point q in a query cluster Q C ¯ can retrieve the k FNs for q among candidate data points in Ω ( B ( Q C ¯ ) ) . □
The CNLJ and nonclustering join algorithms for spatial networks have different time complexities, as shown in Table 4. Notably, the CNLJ algorithm is orthogonal to the kFN query processing algorithms, which can easily be incorporated into the CNLJ algorithm. The simple solution for finding k FNs for a single query point is used in this analysis for simplicity. The time complexity of the kFN query processing is O ( E + V · log V + P · log P ) . The CNLJ algorithm evaluates at most M · Q ¯ kFN queries, where M is the maximum number of border points of a query cluster Q C ¯ , i.e., M = m a x { B ( Q C ¯ ) | Q C ¯ Q ¯ } . The nonclustering join algorithm simply evaluates Q kFN queries because kFN queries for query points should be evaluated sequentially. Thus, the time complexities of the CNLJ and nonclustering join algorithms are O ( | Q ¯ | · ( E + V · log V + P · log P ) ) and O ( | Q | · ( E + V · log V + P · log P ) ) , respectively. The theoretical results imply that the CNLJ algorithm runs faster than the nonclustering join algorithm, particularly when | Q ¯ | | Q | , i.e., the query points are densely clustered. In addition, the results imply that the CNLJ algorithm exhibits similar performance to the nonclustering join algorithm when | Q ¯ | | Q | , i.e., the query points are not clustered.

4.2. Evaluating kFN Queries at Border Points

The CNLJ algorithm evaluates kFN queries at the border points of query clusters Q C ¯ . For the example kFN join query, the CNLJ algorithm evaluates kFN queries at border points q 1 , v 1 , and v 2 , rather than query points q 1 , q 2 , q 3 , and q 4 . First, the kFN query is evaluated at a border point q 1 . The maximum and minimum distances between q 1 and each data cluster in P ¯ are computed, and the data clusters are sorted in descending order based on their maximum distance to q 1 . As shown in Figure 11, the two data clusters { p 1 p 2 p 3 ¯ } and { p 4 p 5 ¯ , p 5 p 6 ¯ } are arranged using their maximum distance to q 1 as follows: P ¯ = { { p 1 p 2 p 3 ¯ } , { p 4 p 5 ¯ , p 5 p 6 ¯ } } . This is because m a x d i s t ( q 1 , { p 1 p 2 p 3 ¯ } ) = 28 and m a x d i s t ( q 1 , { p 4 p 5 ¯ , p 5 p 6 ¯ } ) = 11 , as described in Table 3. The border point q 1 investigates { p 1 p 2 p 3 ¯ } followed by { p 4 p 5 ¯ , p 5 p 6 ¯ } . Following an exploration of { p 1 p 2 p 3 ¯ } , q 1 selects p 2 and p 3 as the two FNs because d i s t ( q 1 , p 1 ) = 24 , d i s t ( q 1 , p 2 ) = 25 , and d i s t ( q 1 , p 3 ) = 27 are computed, as shown in Figure 5. The sentinel distance for q 1 is s n t l _ d i s t = 20 . This is because the maximum distance l between the border points in Q C ¯ is l = d i s t ( q 1 , v 1 ) = 5 , whereas the distance from q 1 to its second FN p 2 is d i s t ( q 1 , p 2 ) = 25 . Thus, a set of candidate data points for query points in Q C ¯ is Ω ( q 1 ) = { p 1 , p 2 , p 3 } . This is because d i s t ( q 1 , p 1 ) s n t l _ d i s t , d i s t ( q 1 , p 2 ) s n t l _ d i s t , and d i s t ( q 1 , p 3 ) s n t l _ d i s t . Clearly, q 1 no longer examines the other data cluster { p 4 p 5 ¯ , p 5 p 6 ¯ } . This is because s n t l _ d i s t is larger than m a x d i s t ( q 1 , { p 4 p 5 ¯ , p 5 p 6 ¯ } ) where m a x d i s t ( q 1 , { p 4 p 5 ¯ , p 5 p 6 ¯ } ) = 11 , as shown in Table 3. Finally, a set of candidate data points for query points in Q C ¯ is Ω ( q 1 ) = { p 1 , p 2 , p 3 } .
At the other border points v 1 and v 2 , we can similarly evaluate kFN queries. Two FNs of v 1 are p 2 and p 3 , as illustrated in Figure 7 and Figure 8. Thus, a set of candidate data points at v 1 is Ω ( v 1 ) = { p 1 , p 2 , p 3 } because the sentinel distance for v 1 is s n t l _ d i s t ( v 1 ) = 15 . Similarly, two FNs of v 2 are p 1 and p 2 , as illustrated in Figure 9 and Figure 10. Thus, a set of candidate data points at v 2 is Ω ( v 2 ) = { p 1 , p 2 , p 3 } because the sentinel distance for v 2 is s n t l _ d i s t ( v 2 ) = 18 . Table 5 summarizes sets of candidate data points for border points q 1 , v 1 , and v 2 and their sentinel distance.

4.3. Evaluating an Example kFN Join Query

The CNLJ algorithm retrieves k FNs for each query point in Q C ¯ among the candidate data points in Ω ( B ( Q C ¯ ) ) . This example kFN join query requires two FNs for each query point, i.e., k = 2 , and the set of candidate data points is Ω ( B ( Q C ¯ ) ) = { p 1 , p 2 , p 3 } . Each of q 1 , q 2 , q 3 , and q 4 retrieves its two FNs among candidate data points p 1 , p 2 , and p 3 . Two FNs of q 1 are first determined, two FNs of q 2 are next determined, and so on. Let us find two FNs for q 1 among the candidate data points p 1 , p 2 , and p 3 . The distances from q 1 to p 1 , p 2 , and p 3 should be computed using the fact that the shortest path from q 1 to a candidate data point should pass through a border point b q . As a result, the length of the shortest path from q 1 to p 1 is equal to d i s t ( q 1 , p 1 ) = m i n { d i s t ( q 1 , b q ) + d i s t ( b q , p 1 ) d i s t ( q 1 , b q ) + d i s t ( b q , p 1 ) b q { q 1 , v 1 , v 2 } b q { q 1 , v 1 , v 2 } } = m i n { d i s t ( q 1 , q 1 ) + d i s t ( q 1 , p 1 ) , d i s t ( q 1 , v 1 ) + d i s t ( v 1 , p 1 ) , d i s t ( q 1 , v 2 ) + d i s t ( v 2 , p 1 ) } = m i n { 24 , 24 , 27 } = 24 . The distance from q 1 to p 2 evaluates to d i s t ( q 1 , p 2 ) = m i n { d i s t ( q 1 , b q ) + d i s t ( b q , p 2 ) d i s t ( q 1 , b q ) + d i s t ( b q , p 2 ) b q { q 1 , v 1 , v 2 } b q { q 1 , v 1 , v 2 } } = 25 . The distance from q 1 to p 3 evaluates to d i s t ( q 1 , p 3 ) = m i n { d i s t ( q 1 , b q ) + d i s t ( b q , p 3 ) d i s t ( q 1 , b q ) + d i s t ( b q , p 3 ) b q { q 1 , v 1 , v 2 } b q { q 1 , v 1 , v 2 } } = 27 . Thus, p 2 and p 3 are two FNs for q 1 whose result set is Ω ( q 1 ) = { p 2 , p 3 } . Next, two FNs for q 2 are retrieved among candidate data points p 1 , p 2 , and p 3 . For this, the distances from q 2 to p 1 , p 2 , and p 3 should be computed. As shown in Table 6, the distances from q 2 to p 1 , p 2 , and p 3 evaluate to d i s t ( q 2 , p 1 ) = 25 , d i s t ( q 2 , p 2 ) = 26 , and d i s t ( q 2 , p 3 ) = 25 , respectively. Thus, p 1 and p 2 are two FNs for q 2 , whose result set is Ω ( q 2 ) = { p 1 , p 2 } . Similarly, two FNs for q 3 and q 4 can be retrieved among candidate data points p 1 , p 2 , and p 3 . Table 6 computes the distance from a query point q to a candidate data point p and retrieves two FNs for q among candidate data points where q { q 1 , q 2 , q 3 , q 4 } and p { p 1 , p 2 , p 3 } . Finally, the kFN join result is the union set of the kFN query results for query points in Q as follows: Ω ( Q ) = Ω ( q 1 ) Ω ( q 2 ) Ω ( q 3 ) Ω ( q 4 ) = { q 1 , { p 2 , p 3 } , q 2 , { p 1 , p 2 } , q 3 , { p 2 , p 3 } , q 4 , { p 2 , p 3 } } .

5. Performance Evaluation

The CNLJ algorithm and its competitors are compared empirically in this section under a variety of conditions. Section 5.1 describes the experimental conditions, and Section 5.2 reports the results of the experiment.

5.1. Experimental Settings

Table 7 describes two real-world roadmaps [40] that were used in the experiments. These real-world roadmaps have different sizes and are part of the United States’ road network. For convenience, the data universe was normalized to a unit square of the plane. The query and data points were generated to mimic the highly skewed distributions of POIs in the real world. Firstly, the centroids c 1 , c 2 , , c m were randomly chosen inside the data universe, where m indicates the total number of centroids and varies between 1 and 10. The query and data points around each centroid displayed a normal distribution, with the mean indicating the centroid and the standard deviation set to σ = 10 2 . Table 8 shows the experimental parameters settings. We varied a single parameter within the range in each experiment while maintaining the other parameters at the bold default values.
The baseline algorithm, which is a nonclustering join algorithm for sequentially computing the k FNs of each query point in Q, was used as a benchmark for evaluating the CNLJ algorithm. We implemented and evaluated two versions of our proposed solution, i.e., CNLJ NV and CNLJ OPT . The naive version called CNLJ NV of the CNLJ algorithm groups query points into query segments, as illustrated in Figure 3a. Thus, CNLJ NV evaluates at most two kFN queries for a query segment. The optimized version called CNLJ OPT of the CNLJ algorithm groups query points into query clusters using the two-step clustering method, as illustrated in Figure 3b. Note that the source codes for empirical evaluation of this study can be accessed via the GitHub site at https://github.com/Hyung-Ju-Cho/ (accessed on 8 February 2021). In the Microsoft Visual Studio 2019 development environment, all join algorithms were implemented in C++. All of the algorithms’ common subroutines were reused for similar tasks. Experiments were conducted on a desktop computer running the Windows 10 operating system with 32 GB RAM and an 8-core processor (i9-9900) at 3.1 GHz. As in several existing studies [36,41] for online map services, this empirical study assumes that all of the algorithms’ indexing structures remain in the main memory to evaluate kFN join queries quickly. The average time required to answer kFN join queries was calculated through repeated experiments using kFN join queries. Finally, we computed the network distance between two points quickly using the TNR method [42]. This is because the TNR method is easy to implement and demonstrates performance comparable to the other shortest distance algorithms [38,41,43,44,45].

5.2. Experimental Results

The proposed CNLJ OPT , CNLJ NV , and baseline algorithms in the NA roadmap are compared in Figure 12. Each chart shows the kFN join query processing time and the number of kFN queries required to evaluate the kFN join query. The numbers of kFN queries required by the CNLJ OPT , CNLJ NV , and baseline algorithms to answer the kFN join query are shown in parentheses in Figure 12, Figure 13 and Figure 14. Note that the CNLJ OPT algorithm evaluates kFN queries at border points of query clusters, the CNLJ NV algorithm evaluates kFN queries at end points of query segments, and the baseline algorithm evaluates kFN queries at query points. As a result, the baseline algorithm evaluates the same number of kFN queries as the number Q of query points in Q. Figure 12a shows the query processing times of the CNLJ OPT , CNLJ NV , and baseline algorithms when the number of the query points changes between 1000 and 5000, i.e., 1000 Q 5000 . In all cases in Q , the CNLJ OPT algorithm is faster than the CNLJ NV and baseline algorithms. When Q = 5000 , the CNLJ OPT , CNLJ NV , and baseline algorithms evaluate 281, 471, and 5000 kFN queries, respectively, and thus the CNLJ OPT algorithm is 1.2 and 36.7 times faster than the CNLJ NV and baseline algorithms, respectively. Figure 12b shows the query processing times when the number of data points changes from 1000 to 5000, i.e., 1000 P 5000 . Regardless of the P value, the CNLJ OPT , CNLJ NV , and baseline algorithms evaluate 58, 96, and 1000 kFN queries, respectively. Thus, when P = 3000 , the CNLJ OPT algorithm outperforms the CNLJ NV and baseline algorithms by 1.2 and 15.6 times, respectively. Figure 12c shows the query processing times when the number of FNs required changes between 1 and 16, i.e., 1 k 16 . For all cases in k, the CNLJ OPT algorithm outperforms the CNLJ NV and baseline algorithms by 1.2 and 13.4 times, respectively. The CNLJ OPT , CNLJ NV , and baseline algorithms’ query processing times are not affected by the k value. This is because the kFN query evaluation computes the distances from a query point to data clusters, regardless of the k value, and then sorts the data clusters using the distances to the query point. Figure 12d shows the query processing times when the number of centroids for query points in Q changes between 1 and 10, i.e., 1 | C Q | 10 . As the | C Q | value increases, the difference between the query processing times of all algorithms decreases. The CNLJ OPT algorithm is 13.3, 1.3, 1.6, 1.7, and 1.1 times faster than the baseline algorithm when | C Q | = 1, 3, 5, 7, and 10, respectively. The reason is that as the | C Q | value increases, the query points are widely dispersed and the number of query clusters increases, slowing down the CNLJ OPT algorithm’s query processing time. Figure 12e shows the query processing times when the number of centroids for data points in P changes between 1 and 10, i.e., 1 | C P | 10 . The kFN query processing time increases with the | C P | value. This is because as the | C P | value increases, the data points are widely dispersed and the number of data clusters to be examined by the kFN queries also increases. To summarize, the CNLJ OPT algorithm outperforms the CNLJ NV and baseline algorithms in all the cases. This confirms that the CNLJ OPT algorithm benefits from clustering of nearby query points and quickly retrieving candidate data points at once for those query points.
In the SJ roadmap, Figure 13 compares the performance of the CNLJ OPT , CNLJ NV , and baseline algorithms. Note that the experimental results using the SJ roadmap exhibit similar performance patterns to those using the NA roadmap. Figure 13a shows the query processing times when 1000 Q 5000 . The CNLJ OPT algorithm is 1.2 and 6.0 times faster than the CNLJ N V and baseline algorithms when Q = 5000 , respectively. Figure 13b shows the query processing times when 1000 P 5000 . The CNLJ OPT algorithm is 1.2 and 4.5 times faster than the CNLJ NV and baseline algorithms when P = 4000 , respectively. The P value increases the query processing times of all algorithms. Figure 13c shows the query processing times when 1 k 16 . The CNLJ OPT algorithm is 1.2 and 3.5 times faster than the CNLJ NV and baseline algorithms, respectively. The query processing times are nearly constant regardless of the k value. Figure 13d shows the query processing times when 1 | C Q | 10 . The CNLJ OPT algorithm is 3.5, 2.4, 1.8, 1.4, and 1.4 times faster than the baseline algorithm when | C Q | = 1, 3, 5, 7, and 10, respectively. The distribution of query points has an impact on the query processing time of the CNLJ OPT algorithm, as shown by this result. The query processing time of the CNLJ OPT algorithm increases with the number of query clusters because the query points are widely dispersed. Figure 13e shows the query processing times when 1 | C P | 10 . The CNLJ OPT algorithm is 2.8, 4.0, 3.5, 2.9, and 3.0 times faster than the baseline algorithm when C P = 1, 3, 5, 7, and 10, respectively.
Figure 14 compares the performance of the CNLJ OPT , CNLJ NV , and baseline algorithms while the numbers of query and data points change between 1000 and 10,000, i.e., 1000 Q 10 , 000 and 1000 P 10 , 000 , to verify the scalability of the CNLJ OPT algorithm. As shown in Figure 14a,c, the CNLJ OPT algorithm runs faster than the CNLJ NV and baseline algorithms for all cases in Q . The performance difference between them typically increases with | Q | . Specifically, when Q = 10 , 000 , the CNLJ OPT algorithm runs 36.6 and 5.3 times faster than the baseline algorithm for NA and SJ roadmaps, respectively. As shown in Figure 14b,d, the CNLJ OPT algorithm runs faster than the CNLJ NV and baseline algorithms for all cases in P . Specifically, when P = 10 , 000 , the CNLJ OPT algorithm runs 6.4 and 3.0 times faster than the baseline algorithm for NA and SJ roadmaps, respectively. The experimental results confirm that the CNLJ OPT algorithm scales better with both | Q | and P than the CNLJ NV and baseline algorithms.

6. Discussion and Conclusions

The kFN join query retrieves a pair of each query point in Q with its k FNs in P, given a positive integer k, a set of query points Q, and a set of data points P. The kFN join query has various real-life applications including recommender systems and computational geometry [6,7,8,9,10,11,12,13,14]. In particular, efficient processing of kFN join queries can aid in selecting a facility’s location that is farthest away from unpleasant facilities such as garbage incinerators, crematoriums, and chemical plants. A cluster nested loop join (CNLJ) algorithm was constructed in this study to efficiently answer kFN join queries for spatial networks. To the best of our knowledge, this is the first attempt to study kFN join queries in spatial networks. The CNLJ algorithm converts query points (data points) into query clusters (data clusters). It then retrieves candidate data points for clustered query points all at once, eliminating the need to search for candidate data points for each query point separately. Using real-life roadmaps in various conditions, the query processing times of the CNLJ and conventional join algorithms were empirically compared. The experimental results demonstrated that the CNLJ algorithm runs up to 50.8 times faster than the conventional join algorithms and that the CNLJ algorithm also better scales with the numbers of both data and query points than the conventional join algorithms. Unfortunately, the CNLJ algorithm shows similar performance to the conventional join algorithms, particularly when query points are uniformly located in the region. We intend to apply the proposed solution to various fields in the future. When the dataset does not fit in the main memory, we will first create index structures on the external memory. Second, we will conduct an empirical study to simulate real-life scenarios using real datasets. Third, we will improve the CNLJ algorithm for the efficient processing of kFN joins over query points that are uniformly scattered in the region.

Funding

This research was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF), funded by the Ministry of Education (NRF-2020R1I1A3052713).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

We thank the anonymous reviewers for their very useful comments and suggestions.

Conflicts of Interest

The author declares no conflict of interest.

References

  1. Said, A.; Kille, B.; Jain, B.J.; Albayrak, S. Increasing diversity through furthest neighbor-based recommendation. In Proceedings of the International Workshop on Diversity in Document Retrieval, Seattle, WA, USA, 12 February 2012; pp. 1–4. [Google Scholar]
  2. Said, A.; Fields, B.; Jain, B.J.; Albayrak, S. User-centric evaluation of a k-furthest neighbor collaborative filtering recommender algorithm. In Proceedings of the International Conference on Computer Supported Cooperative Work and Social Computing, San Antonio, TX, USA, 23–27 February 2013; pp. 1399–1408. [Google Scholar]
  3. Veenman, C.J.; Reinders, M.J.T.; Backer, E. A maximum variance cluster algorithm. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 1273–1280. [Google Scholar] [CrossRef] [Green Version]
  4. Defays, D. An efficient algorithm for a complete link method. Comput. J. 1977, 20, 364–366. [Google Scholar] [CrossRef] [Green Version]
  5. Vasiloglou, N.; Gray, A.G.; Anderson, D.V. Scalable semidefinite manifold learning. In Proceedings of the IEEE Workshop on Machine Learning for Signal Processing, Cancun, Mexico, 16–19 October 2008; pp. 368–373. [Google Scholar]
  6. Curtin, R.R.; Echauz, J.; Gardner, A.B. Exploiting the structure of furthest neighbor search for fast approximate results. Inf. Syst. 2019, 80, 124–135. [Google Scholar] [CrossRef]
  7. Gao, Y.; Shou, L.; Chen, K.; Chen, G. Aggregate farthest-neighbor queries over spatial data. In Proceedings of the International Conference on Database Systems for Advanced Applications, Hong Kong, China, 22–25 April 2011; pp. 149–163. [Google Scholar]
  8. Liu, J.; Chen, H.; Furuse, K.; Kitagawa, H. An efficient algorithm for arbitrary reverse furthest neighbor queries. In Proceedings of the Asia-Pacific Web Conference on Web Technologies and Applications, Kunming, China, 11–13 April 2012; pp. 60–72. [Google Scholar]
  9. Liu, W.; Yuan, Y. New ideas for FN/RFN queries based nearest Voronoi diagram. In Proceedings of the International Conference on Bio-Inspired Computing: Theories and Applications, Huangshan, China, 12–14 July 2013; pp. 917–927. [Google Scholar]
  10. Tran, Q.T.; Taniar, D.; Safar, M. Reverse k nearest neighbor and reverse farthest neighbor search on spatial networks. Trans. Large-Scale Data-Knowl.-Cent. Syst. 2009, 1, 353–372. [Google Scholar]
  11. Wang, H.; Zheng, K.; Su, H.; Wang, J.; Sadiq, S.W.; Zhou, X. Efficient aggregate farthest neighbour query processing on road networks. In Proceedings of the Australasian Database Conference on Databases Theory and Applications, Brisbane, Australia, 14–16 July 2014; pp. 13–25. [Google Scholar]
  12. Xiao, Y.; Liu, B.; Hao, Z.; Cao, L. A k-farthest-neighbor-based approach for support vector data description. Appl. Intell. 2014, 41, 196–211. [Google Scholar] [CrossRef]
  13. Xu, X.-J.; Bao, J.-S.; Yao, B.; Zhou, J.-Y.; Tang, F.-L.; Guo, M.-Y.; Xu, J.-Q. Reverse furthest neighbors query in road networks. J. Comput. Sci. Technol. 2017, 32, 155–167. [Google Scholar] [CrossRef]
  14. Yao, B.; Li, F.; Kumar, P. Reverse furthest neighbors in spatial databases. In Proceedings of the International Conference on Data Engineering, Shanghai, China, 29 March–2 April 2009; pp. 664–675. [Google Scholar]
  15. Dutta, B.; Karmakar, A.; Roy, S. Optimal facility location problem on polyhedral terrains using descending paths. Theor. Comput. Sci. 2020, 847, 68–75. [Google Scholar] [CrossRef]
  16. Gao, X.; Park, C.; Chen, X.; Xie, E.; Huang, G.; Zhang, D. Globally optimal facility locations for continuous-space facility location problems. Appl. Sci. 2021, 11, 7321. [Google Scholar] [CrossRef]
  17. Liu, W.; Wang, H.; Zhang, Y.; Qin, L.; Zhang, W. I/O efficient algorithm for c-approximate furthest neighbor search in high-dimensional space. In Proceedings of the International Conference on Database Systems for Advanced Applications, Jeju, Korea, 24–27 September 2020; pp. 221–236. [Google Scholar]
  18. Huang, Q.; Feng, J.; Fang, Q.; Ng, W. Two efficient hashing schemes for high-dimensional furthest neighbor search. IEEE Trans. Knowl. Data Eng. 2017, 29, 2772–2785. [Google Scholar] [CrossRef]
  19. Liu, Y.; Gong, X.; Kong, D.; Hao, T.; Yan, X. A Voronoi-based group reverse k farthest neighbor query method in the obstacle space. IEEE Access 2020, 8, 50659–50673. [Google Scholar] [CrossRef]
  20. Pagh, R.; Silvestri, F.; Sivertsen, J.; Skala, M. Approximate furthest neighbor in high dimensions. In Proceedings of the International Conference on Similarity Search and Applications, Glasgow, UK, 12–14 October 2015; pp. 3–14. [Google Scholar]
  21. Korn, F.; Muthukrishnan, S. Influence sets based on reverse nearest neighbor queries. In Proceedings of the International Conference on Management of Data, Dallas, TX, USA, 16–18 May 2000; pp. 201–212. [Google Scholar]
  22. Wang, S.; Cheema, M.A.; Lin, X.; Zhang, Y.; Liu, D. Efficiently computing reverse k furthest neighbors. In Proceedings of the International Conference on Data Engineering, Helsinki, Finland, 16–20 May 2016; pp. 1110–1121. [Google Scholar]
  23. Beckmann, N.; Kriegel, H.-P.; Schneider, R.; Seeger, B. The R*-tree: An efficient and robust access method for points and rectangles. In Proceedings of the International Conference on Management of Data, Atlantic City, NJ, USA, 23–25 May 1990; pp. 322–331. [Google Scholar]
  24. Guttman, A. R-trees: A dynamic index structure for spatial searching. In Proceedings of the International Conference on Management of Data, Boston, MA, USA, 18–21 June 1984; pp. 47–57. [Google Scholar]
  25. Huang, Q.; Feng, J.; Fang, Q. Reverse query-aware locality-sensitive hashing for high-dimensional furthest neighbor search. In Proceedings of the International Conference on Data Engineering, San Diego, CA, USA, 19–22 April 2017; pp. 167–170. [Google Scholar]
  26. Lu, H.; Yiu, M.L. On computing farthest dominated locations. IEEE Trans. Knowl. Data Eng. 2011, 23, 928–941. [Google Scholar] [CrossRef]
  27. Cho, H.-J. Efficient shared execution processing of k-nearest neighbor joins in road networks. Mob. Inf. Syst. 2018, 2018, 55–66. [Google Scholar] [CrossRef] [Green Version]
  28. He, D.; Wang, S.; Zhou, X.; Cheng, R. GLAD: A grid and labeling framework with scheduling for conflict-aware knn Queries. IEEE Trans. Knowl. Data Eng. 2021, 33, 1554–1566. [Google Scholar] [CrossRef]
  29. Yang, R.; Niu, B. Continuous k nearest neighbor queries over large-scale spatial-textual data streams. ISPRS Int. J. Geo-Inf. 2020, 9, 694. [Google Scholar] [CrossRef]
  30. Cho, H.-J.; Attique, M. Group processing of multiple k-farthest neighbor queries in road networks. IEEE Access 2020, 8, 110959–110973. [Google Scholar] [CrossRef]
  31. Reza, R.M.; Ali, M.E.; Hashem, T. Group processing of simultaneous shortest path queries in road networks. In Proceedings of the International Conference on Mobile Data Management, Pittsburgh, PA, USA, 15–18 June 2015; pp. 128–133. [Google Scholar]
  32. Zhang, M.; Li, L.; Hua, W.; Zhou, X. Efficient batch processing of shortest path queries in road networks. In Proceedings of the International Conference on Mobile Data Management, Hong Kong, China, 10–13 June 2019; pp. 100–105. [Google Scholar]
  33. Zhang, M.; Li, L.; Hua, W.; Zhou, X. Batch processing of shortest path queries in road networks. In Proceedings of the Australasian Database Conference on Databases Theory and Applications, Sydney, Australia, 29 January–1 February 2019; pp. 3–16. [Google Scholar]
  34. Reza, R.M.; Ali, M.E.; Cheema, M.A. The optimal route and stops for a group of users in a road network. In Proceedings of the International Conference on Advances in Geographic Information Systems, Redondo Beach, CA, USA, 7–10 November 2017; pp. 1–10. [Google Scholar]
  35. Kim, T.; Cho, H.-J.; Hong, H.J.; Nam, H.; Cho, H.; Do, G.Y.; Jeon, P. Efficient processing of k-farthest neighbor queries for road networks. J. Korea Soc. Comput. Inf. 2019, 24, 79–89. [Google Scholar]
  36. Abeywickrama, T.; Cheema, M.A.; Taniar, D. k-nearest neighbors on road networks: A journey in experimentation and in-memory implementation. In Proceedings of the International Conference on Very Large Data Bases, New Delhi, India, 5–9 September 2016; pp. 492–503. [Google Scholar]
  37. Lee, K.C.K.; Lee, W.-C.; Zheng, B.; Tian, Y. ROAD: A new spatial object search framework for road networks. IEEE Trans. Knowl. Data Eng. 2012, 24, 547–560. [Google Scholar] [CrossRef]
  38. Zhong, R.; Li, G.; Tan, K.-L.; Zhou, L.; Gong, Z. G-tree: An efficient and scalable index for spatial search on road networks. IEEE Trans. Knowl. Data Eng. 2015, 27, 2175–2189. [Google Scholar] [CrossRef]
  39. Cormen, T.H.; Leiserson, C.E.; Rivest, R.L.; Stein, C. Introduction to Algorithms, 3rd ed.; MIT Press and McGraw-Hill: Cambridge, MA, USA, 2009; pp. 643–683. [Google Scholar]
  40. Real Datasets for Spatial Databases. Available online: https://www.cs.utah.edu/~lifeifei/SpatialDataset.htm (accessed on 4 October 2021).
  41. Wu, L.; Xiao, X.; Deng, D.; Cong, G.; Zhu, A.D.; Zhou, S. Shortest path and distance queries on road networks: An experimental evaluation. In Proceedings of the International Conference on Very Large Data Bases, Istanbul, Turkey, 27–31 August 2012; pp. 406–417. [Google Scholar]
  42. Bast, H.; Funke, S.; Matijevic, D. Ultrafast shortest-path queries via transit nodes. In Proceedings of the International Workshop on Shortest Path Problem, Piscataway, NJ, USA, 13–14 November 2006; pp. 175–192. [Google Scholar]
  43. Geisberger, R.; Sanders, P.; Schultes, D.; Delling, D. Contraction hierarchies: Faster and simpler hierarchical routing in road networks. In Proceedings of the International Workshop on Experimental Algorithms, Cape Cod, MA, USA, 30 May–2 June 2008; pp. 319–333. [Google Scholar]
  44. Li, Z.; Chen, L.; Wang, Y. G*-tree: An efficient spatial index on road networks. In Proceedings of the International Conference on Data Engineering, Macao, China, 8–11 April 2019; pp. 268–279. [Google Scholar]
  45. Samet, H.; Sankaranarayanan, J.; Alborzi, H. Scalable network distance browsing in spatial databases. In Proceedings of the International Conference on Management of Data, Vancouver, BC, Canada, 9–12 June 2008; pp. 43–54. [Google Scholar]
Figure 1. Example of kFN join Q k F N P , where Q = { q 1 , q 2 , q 3 } and P = { p 1 , p 2 , p 3 , p 4 } .
Figure 1. Example of kFN join Q k F N P , where Q = { q 1 , q 2 , q 3 } and P = { p 1 , p 2 , p 3 , p 4 } .
Ijgi 11 00123 g001
Figure 2. Example of kFN join Q P in a spatial network.
Figure 2. Example of kFN join Q P in a spatial network.
Ijgi 11 00123 g002
Figure 3. Two-step clustering method to group nearby query points into query clusters: (a) converting query points into query segments; (b) converting query segments into query clusters.
Figure 3. Two-step clustering method to group nearby query points into query clusters: (a) converting query points into query segments; (b) converting query segments into query clusters.
Ijgi 11 00123 g003
Figure 4. Two-step clustering method to group nearby data points into a data cluster: (a) converting data points into data segments; (b) converting data segments into data clusters.
Figure 4. Two-step clustering method to group nearby data points into a data cluster: (a) converting data points into data segments; (b) converting data segments into data clusters.
Ijgi 11 00123 g004
Figure 5. m a x d i s t ( q 1 , { p 1 p 2 p 3 ¯ } ) = 28 and m i n d i s t ( q 1 , { p 1 p 2 p 3 ¯ } ) = 24 .
Figure 5. m a x d i s t ( q 1 , { p 1 p 2 p 3 ¯ } ) = 28 and m i n d i s t ( q 1 , { p 1 p 2 p 3 ¯ } ) = 24 .
Ijgi 11 00123 g005
Figure 6. m a x d i s t ( q 1 , { p 4 p 5 ¯ , p 5 p 6 ¯ } ) = 11 and m i n d i s t ( q 1 , { p 4 p 5 ¯ , p 5 p 6 ¯ } ) = 5 : (a) m a x d i s t ( q 1 , p 4 p 5 ¯ ) = 8 and m i n d i s t ( q 1 , p 4 p 5 ¯ ) = 5 ; (b) m a x d i s t ( q 1 , p 5 p 6 ¯ ) = 11 and m i n d i s t ( q 1 , p 5 p 6 ¯ ) = 8 .
Figure 6. m a x d i s t ( q 1 , { p 4 p 5 ¯ , p 5 p 6 ¯ } ) = 11 and m i n d i s t ( q 1 , { p 4 p 5 ¯ , p 5 p 6 ¯ } ) = 5 : (a) m a x d i s t ( q 1 , p 4 p 5 ¯ ) = 8 and m i n d i s t ( q 1 , p 4 p 5 ¯ ) = 5 ; (b) m a x d i s t ( q 1 , p 5 p 6 ¯ ) = 11 and m i n d i s t ( q 1 , p 5 p 6 ¯ ) = 8 .
Ijgi 11 00123 g006
Figure 7. m a x d i s t ( v 1 , { p 1 p 2 p 3 ¯ } ) = 24 and m i n d i s t ( v 1 , { p 1 p 2 p 3 ¯ } ) = 19 .
Figure 7. m a x d i s t ( v 1 , { p 1 p 2 p 3 ¯ } ) = 24 and m i n d i s t ( v 1 , { p 1 p 2 p 3 ¯ } ) = 19 .
Ijgi 11 00123 g007
Figure 8. m a x d i s t ( v 1 , { p 4 p 5 ¯ , p 5 p 6 ¯ } ) = 12 and m i n d i s t ( v 1 , { p 4 p 5 ¯ , p 5 p 6 ¯ } ) = 9 : (a) m a x d i s t ( v 1 , p 4 p 5 ¯ ) = 11 and m i n d i s t ( v 1 , p 4 p 5 ¯ ) = 9 ; (b) m a x d i s t ( v 1 , p 5 p 6 ¯ ) = 12 and m i n d i s t ( v 1 , p 5 p 6 ¯ ) = 9 .
Figure 8. m a x d i s t ( v 1 , { p 4 p 5 ¯ , p 5 p 6 ¯ } ) = 12 and m i n d i s t ( v 1 , { p 4 p 5 ¯ , p 5 p 6 ¯ } ) = 9 : (a) m a x d i s t ( v 1 , p 4 p 5 ¯ ) = 11 and m i n d i s t ( v 1 , p 4 p 5 ¯ ) = 9 ; (b) m a x d i s t ( v 1 , p 5 p 6 ¯ ) = 12 and m i n d i s t ( v 1 , p 5 p 6 ¯ ) = 9 .
Ijgi 11 00123 g008
Figure 9. m a x d i s t ( v 2 , { p 1 p 2 p 3 ¯ } ) = 25.5 and m i n d i s t ( v 2 , { p 1 p 2 p 3 ¯ } ) = 23 .
Figure 9. m a x d i s t ( v 2 , { p 1 p 2 p 3 ¯ } ) = 25.5 and m i n d i s t ( v 2 , { p 1 p 2 p 3 ¯ } ) = 23 .
Ijgi 11 00123 g009
Figure 10. m a x d i s t ( v 2 , { p 4 p 5 ¯ , p 5 p 6 ¯ } ) = 8 and m i n d i s t ( v 2 , { p 4 p 5 ¯ , p 5 p 6 ¯ } ) = 5 : (a) m a x d i s t ( v 2 , p 4 p 5 ¯ ) = 8 and m i n d i s t ( v 2 , p 4 p 5 ¯ ) = 5 ; (b) m a x d i s t ( v 2 , p 5 p 6 ¯ ) = 8 and m i n d i s t ( v 2 , p 5 p 6 ¯ ) = 5 .
Figure 10. m a x d i s t ( v 2 , { p 4 p 5 ¯ , p 5 p 6 ¯ } ) = 8 and m i n d i s t ( v 2 , { p 4 p 5 ¯ , p 5 p 6 ¯ } ) = 5 : (a) m a x d i s t ( v 2 , p 4 p 5 ¯ ) = 8 and m i n d i s t ( v 2 , p 4 p 5 ¯ ) = 5 ; (b) m a x d i s t ( v 2 , p 5 p 6 ¯ ) = 8 and m i n d i s t ( v 2 , p 5 p 6 ¯ ) = 5 .
Ijgi 11 00123 g010
Figure 11. Arranging data clusters in decreasing order of their maximum distance to q 1 .
Figure 11. Arranging data clusters in decreasing order of their maximum distance to q 1 .
Ijgi 11 00123 g011
Figure 12. Comparison of kFN join query processing times for the NA roadmap: (a) 10 3 Q 5 × 10 3 ; (b) 10 3 P 5 × 10 3 ; (c) 1 k 16 ; (d) 1 C Q 10 ; (e) 1 C P 10 .
Figure 12. Comparison of kFN join query processing times for the NA roadmap: (a) 10 3 Q 5 × 10 3 ; (b) 10 3 P 5 × 10 3 ; (c) 1 k 16 ; (d) 1 C Q 10 ; (e) 1 C P 10 .
Ijgi 11 00123 g012
Figure 13. Comparison of kFN join query processing times for the SJ roadmap: (a) 10 3 Q 5 × 10 3 ; (b) 10 3 P 5 × 10 3 ; (c) 1 k 16 ; (d) 1 C Q 10 ; (e) 1 C P 10 .
Figure 13. Comparison of kFN join query processing times for the SJ roadmap: (a) 10 3 Q 5 × 10 3 ; (b) 10 3 P 5 × 10 3 ; (c) 1 k 16 ; (d) 1 C Q 10 ; (e) 1 C P 10 .
Ijgi 11 00123 g013
Figure 14. Scalability test: (a) 10 3 Q 10 4 for NA; (b) 10 3 P 10 4 for NA; (c) 10 3 Q 10 4 for SJ; (d) 10 3 P 10 4 for SJ.
Figure 14. Scalability test: (a) 10 3 Q 10 4 for NA; (b) 10 3 P 10 4 for NA; (c) 10 3 Q 10 4 for SJ; (d) 10 3 P 10 4 for SJ.
Ijgi 11 00123 g014
Table 1. Classification of related work.
Table 1. Classification of related work.
ReferencesSpace DomainQuery TypeData Type
[8,9,14,19]Euclidean spaceRkFN searchMonochromatic
[14,22]Euclidean spaceRkFN searchBichromatic
[6,9,17,18,20,25]Euclidean spacekFN search
[7]Euclidean spaceAkFN search
[26]Euclidean spaceFDL search
[13]Spatial networkRkFN searchMonochromatic
[10,13]Spatial networkRkFN searchBichromatic
[35]Spatial networkkFN search
[11]Spatial networkAkFN search
This studySpatial networkkFN join
Table 2. Symbols used in this paper and their meanings.
Table 2. Symbols used in this paper and their meanings.
SymbolDefinition
kNumber of requested FNs
Q and qA set Q of query points and query point q in Q, respectively
P and pA set P of data points and data point p in P, respectively
v l v l + 1 v m ¯ Vertex sequence where v l and v m are either an intersection vertex or a terminal vertex and the other vertices, v l + 1 , , v m 1 , are intermediate vertices
q i q i + 1 q j ¯ Query segment connecting query points q i , q i + 1 , , q j in a vertex sequence (in short, q i q j ¯ )
p l p l + 1 p m ¯ Data segment connecting data points p l , p l + 1 , , p m in a vertex sequence (in short, p l p m ¯ )
Q C ¯ and P C ¯ Set of query segments and set of data segments, respectively
Q ¯ and P ¯ Set of query clusters and set of data clusters, respectively
B ( Q C ¯ ) and B ( P C ¯ ) Sets of border points of Q C ¯ and P C ¯ , respectively
b q and b p Border points of Q C ¯ and P C ¯ , respectively
Ω ( q ) Set of k data points farthest from a query point q
d i s t ( q , p ) Length of the shortest path connecting points q and p
l e n ( q p ¯ ) Length of the segment q p ¯
Table 3. Maximum and minimum distances between border points and data clusters.
Table 3. Maximum and minimum distances between border points and data clusters.
b q { p 1 p 2 p 3 ¯ } { p 4 p 5 ¯ , p 5 p 6 ¯ }
q 1 m a x d i s t ( q 1 , { p 1 p 2 p 3 ¯ } ) = 28 m a x d i s t ( q 1 , { p 4 p 5 ¯ , p 5 p 6 ¯ } ) = 11
m i n d i s t ( q 1 , { p 1 p 2 p 3 ¯ } ) = 24 m i n d i s t ( q 1 , { p 4 p 5 ¯ , p 5 p 6 ¯ } ) = 5
v 1 m a x d i s t ( v 1 , { p 1 p 2 p 3 ¯ } ) = 24 m a x d i s t ( v 1 , { p 4 p 5 ¯ , p 5 p 6 ¯ } ) = 12
m i n d i s t ( v 1 , { p 1 p 2 p 3 ¯ } ) = 19 m i n d i s t ( v 1 , { p 4 p 5 ¯ , p 5 p 6 ¯ } ) = 9
v 2 m a x d i s t ( v 2 , { p 1 p 2 p 3 ¯ } ) = 25.5 m a x d i s t ( v 2 , { p 4 p 5 ¯ , p 5 p 6 ¯ } ) = 8
m i n d i s t ( v 2 , { p 1 p 2 p 3 ¯ } ) = 23 m i n d i s t ( v 2 , { p 4 p 5 ¯ , p 5 p 6 ¯ } ) = 5
Table 4. Comparison of time complexities of the CNLJ and nonclustering join algorithms.
Table 4. Comparison of time complexities of the CNLJ and nonclustering join algorithms.
CNLJ AlgorithmNonclustering Join Algorithm
Number of kFN queries to be evaluated M · | Q ¯ | | Q |
Time complexity to evaluate the kFN search O ( E + V log V + P log P ) O ( E + V log V + P log P )
Time complexity to evaluate the kFN join O ( | Q ¯ | · ( E + V log V + P log P ) ) O ( | Q | · ( E + V log V + P log P ) )
Table 5. Results of kFN queries at q 1 , v 1 , and v 2 and their sentinel distances.
Table 5. Results of kFN queries at q 1 , v 1 , and v 2 and their sentinel distances.
b q dist ( b q , p ) sntl _ dist ( b q ) Ω ( b q )
q 1 d i s t ( q 1 , p 1 ) = 24 s n t l _ d i s t ( q 1 ) = 20 Ω ( q 1 ) = { p 1 , p 2 , p 3 }
d i s t ( q 1 , p 2 ) = 25
d i s t ( q 1 , p 3 ) = 27
v 1 d i s t ( v 1 , p 1 ) = 19 s n t l _ d i s t ( v 1 ) = 15 Ω ( v 1 ) = { p 1 , p 2 , p 3 }
d i s t ( v 1 , p 2 ) = 20
d i s t ( v 1 , p 3 ) = 24
v 2 d i s t ( v 2 , p 1 ) = 23 s n t l _ d i s t ( v 2 ) = 18 Ω ( v 2 ) = { p 1 , p 2 , p 3 }
d i s t ( v 2 , p 2 ) = 24
d i s t ( v 2 , p 3 ) = 23
Table 6. Retrieval of two FNs for query points among candidate data points.
Table 6. Retrieval of two FNs for query points among candidate data points.
q dist ( q , p ) Ω ( q )
q 1 d i s t ( q 1 , p 1 ) = 24 Ω ( q 1 ) = { p 2 , p 3 }
d i s t ( q 1 , p 2 ) = 25
d i s t ( q 1 , p 3 ) = 27
q 2 d i s t ( q 2 , p 1 ) = 25 Ω ( q 2 ) = { p 1 , p 2 } or
d i s t ( q 2 , p 2 ) = 26 Ω ( q 2 ) = { p 2 , p 3 }
d i s t ( q 2 , p 3 ) = 25
q 3 d i s t ( q 3 , p 1 ) = 21 Ω ( q 3 ) = { p 2 , p 3 }
d i s t ( q 3 , p 2 ) = 22
d i s t ( q 3 , p 3 ) = 25
q 4 d i s t ( q 4 , p 1 ) = 21 Ω ( q 4 ) = { p 2 , p 3 }
d i s t ( q 4 , p 2 ) = 22
d i s t ( q 4 , p 3 ) = 26
Table 7. Real-world roadmaps [40].
Table 7. Real-world roadmaps [40].
NameDescriptionVerticesEdgesVertex Sequences
NAHighways in North America (NA)175,813179,17912,416
SJCity streets in San Joaquin (SJ), California18,26323,87420,040
Table 8. Experimental parameter settings.
Table 8. Experimental parameter settings.
ParameterRange
Number of query points ( | Q | )1, 2, 3, 4, 5, 7, 10 ( × 10 3 )
Number of data points ( | P | )1, 2, 3, 4, 5, 7, 10 ( × 10 3 )
Number of FNs required (k)1, 2, 4, 8, 16
Distribution of query and data pointsCentroid distribution
Number of centroids for query points in Q ( | C Q | )1, 3, 5, 7, 10
Number of centroids for data points in P ( | C P | )1, 3, 5, 7, 10
The standard deviation for normal distribution ( σ ) 10 2
RoadmapNA, SJ
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Cho, H.-J. Cluster Nested Loop k-Farthest Neighbor Join Algorithm for Spatial Networks. ISPRS Int. J. Geo-Inf. 2022, 11, 123. https://doi.org/10.3390/ijgi11020123

AMA Style

Cho H-J. Cluster Nested Loop k-Farthest Neighbor Join Algorithm for Spatial Networks. ISPRS International Journal of Geo-Information. 2022; 11(2):123. https://doi.org/10.3390/ijgi11020123

Chicago/Turabian Style

Cho, Hyung-Ju. 2022. "Cluster Nested Loop k-Farthest Neighbor Join Algorithm for Spatial Networks" ISPRS International Journal of Geo-Information 11, no. 2: 123. https://doi.org/10.3390/ijgi11020123

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop