Distributed Processing of Location-Based Aggregate Queries Using MapReduce

Abstract: The location-based aggregate queries, consisting of the shortest average distance query (SAvgDQ), the shortest minimal distance query (SMinDQ), the shortest maximal distance query (SMaxDQ), and the shortest sum distance query (SSumDQ) are new types of location-based queries. Such queries can be used to provide the user with useful object information by considering both the spatial closeness of objects to the query object and the neighboring relationship between objects. Due to a large amount of location-based aggregate queries that need to be evaluated concurrently, the centralized processing system would suffer a heavy query load, leading eventually to poor performance. As a result, in this paper, we focus on developing the distributed processing technique to answer multiple location-based aggregate queries, based on the MapReduce platform. We first design a grid structure to manage information of objects by taking into account the storage balance, and then develop a distributed processing algorithm, namely the MapReduce-based aggregate query algorithm (MRAggQ algorithm), to efficiently process the location-based aggregate queries in a distributed manner. Extensive experiments using synthetic and real datasets are conducted to demonstrate the scalability and the efficiency of the proposed processing algorithm.


Introduction
With the fast advances of ubiquitous and mobile computing, processing the location-based queries on spatial objects [1][2][3][4][5][6] has become essential for various applications, such as traffic control systems, location-aware advertisements, and mobile information systems.Currently, most of the conventional location-based queries focus exclusively on a single type of objects (e.g., the nearest neighbor query finds a closest restaurant or hotel to the user).In other words, the different types of objects (termed the heterogeneous objects) are independently considered in processing the location-based queries, which means that the neighboring relationship between the heterogeneous objects is completely ignored.Let us consider a scenario where the user wants to stay in a hotel, have lunch in a restaurant, and go to the movies.Here, the hotel, the restaurant, and the theater refer to the heterogeneous objects.If the nearest neighbor queries are independently processed for the heterogeneous objects, the user is able to know his/her closest hotel, restaurant, and theater, which, however, may actually be far away from each other.Therefore, in addition to the spatial closeness of the heterogeneous objects to the query point, the neighboring relationship between the heterogeneous objects should also play an important role in determining the query result.
In the previous work [7], we present the location-based aggregate queries to provide information of the heterogeneous objects by taking into account both the neighboring relationship and the spatial closeness of the heterogeneous objects.In order to preserve the neighboring relationship between the heterogeneous objects, the location-based aggregate queries aim at finding the heterogeneous objects closer to each other by constraining their distance to be within a user-defined distance d.The set of objects satisfying the constraint of distance d is termed the heterogeneous neighboring object set (or HNO set).On the other hand, for maintaining the spatial closeness of the heterogeneous objects to the query point, four types of location-based aggregate queries are presented to provide information of HNO set according to specific user requirement.They are the shortest average-distance query (or SAvgDQ), the shortest minimal-distance query (or SMinDQ), the shortest maximal-distance query (or SMaxDQ), and the shortest sum-distance query (or SSumDQ), which are described respectively as follows.
• Consider the n types of objects, O 1 , O 2 , ..., O n .Assume that there are m HNO sets, {o 1  1 , o  for the SMaxDQ, the distance of an object for the SSumDQ, the traveling distance from q to {o Let us use Figure 1 to illustrate how to process the four types of location-based aggregate queries (i.e., the SAvgDQ, the SMinDQ, the SMaxDQ, and the SSumDQ).As shown in Figure 1a, there are three types of data objects in the space, the hotels h 1 to h 5 , the restaurants r 1 to r 5 , and the theaters t 1 to t 5 .Assume that the user-defined distance d is set to 2 (that is, the distance between any two objects should be less than or equal to 2), which leads to three HNO sets, {h 1 , r 3 , t 1 }, {h 2 , r 1 , t 3 }, and {h 3 , r 2 , t 2 } (shown as the gray areas).Take the query point q 1 in Figure 1b, issuing the SAvgDQ, as an example.For each HNO set, the distance between each object in the HNO set and the query point q 1 needs to be first computed and then the HNO set with the shortest average-distance to q 1 is the result set of the SAvgDQ (i.e., the set {h 2 , r 1 , t 3 }).Meanwhile, the SMinDQ and the SMaxDQ issued by the query points q 2 and q 3 , respectively, also need to be evaluated.When the SMinDQ is considered, the distances of the objects closest to q 2 in {h 1 , r 3 , t 1 }, {h 2 , r 1 , t 3 }, and {h 3 , r 2 , t 2 }, respectively, are compared to each other, and then the HNO set (i.e., {h 3 , r 2 , t 2 }) containing q 2 's nearest neighbor is returned as the result set.In contrast to the SMinDQ, the SMaxDQ takes the furthest object in each HNO set into account.For the query point q 3 , its furthest objects in the three HNO sets are t 1 , t 3 , and t 2 , respectively.Among them, object t 1 has the shortest distance to q 3 , and hence the SMaxDQ retrieves the set {h 1 , r 3 , t 1 } because it contains t 1 .Consider the SSumDQ issued from the query point q 4 , which is processed simultaneously by the system.The shortest traveling path for each of the three HNO sets {h 1 , r 3 , t 1 }, {h 2 , r 1 , t 3 }, and {h 3 , r 2 , t 2 } has to be determined so as to find the HNO set resulting in a shortest traveling distance from q 4 .Finally, the set {h 1 , r 3 , t 1 } can be the SSumDQ result because of its shortest path The processing techniques developed in [7] focus only on efficiently processing a location-based aggregate query (corresponding to SAvgDQ, SMinDQ, SMaxDQ, or SSumDQ).However, in highly dynamic environments, where users can obtain object information through the portable computers (e.g., laptops, 3G mobile phones, and tablet PCs), multiple location-based aggregate queries must be issued by the users from anywhere and anytime (For instance, in Figure 1, the SAvgDQ, the SMinDQ, the SMaxDQ, and the SSumDQ are issued from different query points at the same time.)It means that, when there is a large number of location-based aggregate queries processed concurrently, the time spent on sequentially evaluating the location-based aggregate queries would dramatically increase.Even worse, at the time at which a location-based aggregate query terminates, the query result may already be outdated.As a result, it is necessary to design the distributed processing techniques to rapidly evaluate multiple location-based aggregate queries.
To achieve the objective of distributed processing of location-based aggregate queries, we adopt the most notable platform, MapReduce [8], for processing multiple queries over large-scale datasets by involving a number of share-nothing machines.For data storage, an existing distributed file system (DFS), such as Google File System (GFS) or Hadoop Distributed File System (HDFS), is usually used as the underlying storage system.Based on the partitioning strategy used in the DFS, data are divided into equal-sized chunks, which are distributed over the machines.For query processing, the MapReduce-based algorithm executes in several jobs, each of which has three phases: map, shuffle, and reduce.In the map phase, each participating machine prepares information to be delivered to other machines.As for the shuffle phase, it is responsible for the actual data transfer.In the reduce phase, each machine performs calculation using its local storage.The current job finishes after the reduce phase.If the process has not been completed, another MapReduce job starts.Depending on the applications, the MapReduce job may be executed once or multiple times.
In this paper, we focus on developing the MapReduce-based methods to efficiently answer multiple location-based aggregate queries (consisting of numerous SAvgDQ, SMinDQ, SMaxDQ, and SSumDQ issued concurrently from different query points) in a distributed manner.We first utilize a grid structure to manage the heterogeneous objects in the space by taking into account the storage balance, and information of the partitioned object data in each grid cell is stored in the DFS.Next, we propose a distributed processing algorithm, namely the MapReduce-based aggregate query algorithm (MRAggQ algorithm for short), which is composed of four phases: the Inner HNO set determining phase, the Outer HNO set determining phase, the Aggregate-distance computing phase, and the Result set generating phase, each of which executes a MapReduce job to finish the procedure.Finally, we conduct a comprehensive set of experiments over synthetic and real datasets, demonstrating the efficiency, the robustness, and the scalability of the proposed MRAggQ algorithm, in terms of the average running time in performing different workloads of location-based aggregate queries.
The rest of this paper is organized as follows.In Section 2, we review the previous work on processing various types of location-based queries in centralized and distributed environments.Section 3 describes the grid structure used for maintaining information of the heterogeneous objects.In Section 4, we present how the MRAggQ algorithm can be used to process multiple location-based aggregate queries efficiently.Section 5 shows extensive experiments on the performance of the proposed methods.In Section 6, we conclude the paper with directions on future work.

Related Works
Efficient processing of the location-based queries is an emerging research topic in recent years.Here, we first review the centralized methods for processing the location-based queries on a single object type and multiple types of objects (i.e., the heterogeneous objects).Then, we discuss the MapReduce programming technique and survey some works on processing the location-based queries using MapReduce.

Centralized Processing Techniques for Location-Based Queries
Most of the conventional location-based queries on a single data type concentrate on discovering the spatial closeness of objects to the query object.The range query [9,10] is a well-known query, used to find a set of objects that are inside a spatial region specified by the user.If the spatial region is constructed according to the location of the query object q, another variation of range query, the within query [11,12], is presented to find the objects whose distances to q are less than or equal to a user-given distance d (i.e., finding the objects within the region centered at q with radius d).Recently, many efforts have been made on processing the range and within queries in different research domains, such as mobile information systems [3,13] and uncertain database systems [2,14].The nearest neighbor query [15,16] is the most common type of location-based queries, as it has important applications to the provision of location-based services.Many variations of nearest neighbor query have been proposed in numerous applications.To address the issue of scalability, the KNN join query [17,18] is presented to find the K-nearest neighbors for all objects in a query set.To express requests by groups of users, the aggregate nearest neighbor (ANN) query (a.k.a. group nearest neighbor query) is proposed by Papadias et al. [19].Given a set of query objects Q and a set of objects O, ANN query returns the object in O minimizing an aggregate distance function (e.g., sum or max) with respect to the objects in Q.A variation of nearest neighbor query with asymmetric property is the reverse nearest neighbor (RNN) query [1].Given the query object q, the RNN query retrieves the set of objects whose nearest neighbor is q.The skyline query, also known as the maximal vector problem [20,21], is first studied in the area of computational geometry.Then, Borzsonyi et al. [22] introduce the skyline operator into database systems.If an object is not dominated by any other objects in terms of multiple attributes, then it is a skyline point.By taking into account the object locations, the spatial skyline query [4] is proposed, where the distance of objects plays an important role in determining the skyline points.Given a set of m query objects and a set of n data objects, each data object has m attributes, each of which refers to its distance to a query object.The spatial skyline query retrieves the skyline points that are not dominated in terms of the m attributes.
Some related work on processing the location-based queries tries to keep the neighboring relationship between the heterogeneous objects.Given two types of data objects A and B, the K closest pair query [23] finds the K closest object pairs between A and B (that is, the K pairs (a, b), where a ∈ A and b ∈ B, with the smallest distance between them).Another type of location-based queries on the two data sources is the spatial join query [24], which maintains a set of object pairs (each pair has one item from the two data sources respectively) satisfying a given spatial predicate (e.g., overlap or coverage).Papadias et al. [25] further extend the spatial join query to the multiway spatial join query, in which the spatial predicate is a function over m data sources (where m ≥ 2).Zhang et al. [26] present the KNG query to determine the query result based on (1) the minimum distance between the heterogeneous objects and the query object (referred to as inter-group distance) and (2) the maximum distance among the heterogeneous objects (referred to as inner-group distance).Given a spatial database with m types of data objects and a query object q, the KNG query returns the K groups (each of which consists of one object from each data type) with the minimum sum of the inner-group distance and the inter-group distance.However, due to the fact that the KNG query considers the sum of inner-group and inter-group distances, the object group retrieved by executing the KNG query is likely to be close to the query object but far away from each other (i.e., the inter-group distance dominates the query result), or close to each other but far away from the query object (i.e., the inner-group distance affects the result).To appropriately keep the spatial closeness and the neighboring relationship of objects, in our previous work [7], the location-based aggregate queries are presented to obtain information of the NHO sets.

Distributed Processing Techniques for Location-Based Queries
As mentioned in Section 1, MapReduce is a popular programming framework, which can be used to support the distributed processing of location-based queries.A MapReduce algorithm proceeds in several jobs, each of which has the map, the shuffle, and the reduce phases.In the map phase, for each participating machine, a list of key-value pairs (k, v) is generated from its local storage, where the key k is usually numeric and the value v corresponds to arbitrary information.According to the key k, each pair (k, v) is transmitted to another machine in the shuffle phase.More specifically, the shuffle phase distributes the key-value pairs across the machines following the rule that pairs with the same key are delivered to the same machine.In the reduce phase, each machine incorporates the key-value pairs received form the shuffle phase into its local storage, and performs the task using the local data.When the reduce phases of all machines are completed, the current MapReduce job terminates.
There has been considerable interest on supporting location-based queries over MapReduce framework.Cary et al. [27] present the techniques for building R-trees based on MapReduce, which, however, do not address the issues of processing the location-based queries.Zhang et al. [28] show how the location-based queries can be naturally expressed in MapReduce framework, including the spatial selection queries, the spatial join queries, and the nearest neighbor queries.Ji et al. [29] propose a MapReduce-based approach, in which an inverted grid structure is built to index data objects, to answer the KNN queries.Furthermore, in [30], they extend their approach to process a variant of KNN queries, the RKNN query.Akdogan et al. [31] focus on processing various types of location-based queries (including RNN, MaxRNN, and KNN queries), by creating a Voronoi diagram based on the MapReduce programming model for data objects.In their method, each data object is represented as a pivot which is then used to partition the space.Yokoyama et al. [32] propose a method that decomposes the given space into cells and evaluates the AKNN queries using MapReduce in a distributed and parallel manner.Zhang et al. [33] present the exact and approximate MapReduce-based algorithms to efficiently perform parallel KNN join queries on a large-scale dataset.To improve the performance of KNN join queries, Lu et al. [34] further design an effective mapping mechanism, by exploiting pruning rules for distance filtering, to reduce both the shuffling and computational costs.
Recently, Eldawy et al. [35,36] focus on developing a MapReduce framework, the SpatialHadoop, which is a comprehensive extension of Hadoop.The SpatialHadoop provides an expressive high level language for spatial objects, adapts a set of spatial index structures (e.g., Grid structure, R-tree, and R + -tree) which is built-in HDFS, and supports the traditional location-based queries (including the range, KNN, and spatial join queries).Moreover, in [37], they address the issue of processing the skewed distributed datasets in the SpatialHadoop, by presenting a box counting function to detect the degree of skewness of a spatial dataset.The SpatialHadoop is carefully designed for the location-based queries, in which the spatial closeness of a single type of objects to the query point is a main concern in determining the query result.However, it cannot directly be applied for answering the location-based aggregate queries because (1) the query result consists of the heterogeneous objects, rather than a single type of objects, and (2) whether the heterogeneous objects satisfy the constraint of distance d (i.e., with the better neighboring relationship) should be taken into account.

Grid Structure
In our model, there are n types of data objects (i.e., the heterogeneous objects) in the space.As the location database contains large amounts of information that need to be maintained, a grid structure is used to manage such information by partitioning the space into multiple gird cells, each of which stores data of objects enclosed in it.In order to balance the storage load of each grid cell, the data space is partitioned into C × C equal-sized cells by considering a pre-defined parameter α.Initially, all the heterogeneous objects are grouped into 1 × 1 cells.Then, the number of objects enclosed in a cell is compared with the parameter α.Once the object number is greater than α, the data space covering all objects is repartitioned into 2 × 2 cells.Similarly, if there still exists a cell within which the object number exceeds α, then the data space needs to be repartitioned into 3 × 3 cells.This partitioning process continues until each cell cell(c) satisfies the condition that the number of objects in cell(c) is less than or equal to α.By exploiting the parameter α, the storage overhead for maintaining information of objects can be evenly distributed among the cells.Figure 2 shows an example of how the data space is divided by taking into account the storage load of each cell.As shown in Figure 2a, there are three types of data objects, R, S, and T in the space, each of which has five objects with coordinate (x, y) (e.g., object r 1 's coordinate (x, y) refers to (3,14)).Suppose that the pre-defined parameter α is set to 3. The data space would be divided into 3 × 3 cells, so as to guarantee that the number of objects in each cell does not exceed 3. The final divided grid cells, which are numbered from 0 to 8, are shown in Figure 2b.cell(0) cell( 2) cell (1) cell (3) cell( 5) cell( 4) cell (6) cell( 8) cell (7) (b) In order to provide parallel processing of the heterogeneous objects using MapReduce, information of the grid structure is stored in a distributed storage system, the HDFS, by default.The HDFS consists of multiple DataNodes for storing data and a NameNode for monitoring all DataNodes.In the HDFS, a file is broken into multiple equal-sized chunks and then the NameNode allocates the data chunks among the DataNodes for query processing.Returning to the example in Figure 2, the cells, cell(0) to cell (8), are treated as the chunks and kept on the HDFS.Take the cell cell(1) as an example, as objects r 3 and s 3 are enclosed in cell(1), in the HDFS, the chunks with respect to cell(1) will store r 3 and s 3 with their coordinates (17,10) and (17,9), respectively.Note that the cells cell(0) and cell (6) need not be kept on the HDFS because there is no object in them.Figure 2c shows how the grid structure for the heterogeneous objects is stored on the HDFS.

Mapreduce-Based Aggregate Query Algorithm
Given the n types of data objects, O 1 , O 2 , ..., O n , a set of query points Q (where a query point q ∈ Q corresponds to a SAvgDQ, a SMinDQ, a SMaxDQ, or a SSumDQ), and the user-defined distance d, the main goal of the MapReduce-based aggregate query (MRAggQ) algorithm is to efficiently determine, for each query point q, the HNO set with the shortest distance in a distributed manner.Recall that a set of objects {o 1 , o 2 , ..., o n } (where o i ∈ O i and i = 1 ∼ n) can be included in the result set of the location-based aggregate queries only if the following two conditions hold: (1) the distance between any two objects in {o 1 , o 2 , ..., o n } is less than or equal to d (as a necessary condition) and ( 2) {o 1 , o 2 , ..., o n } has the shortest average, minimal, maximal, or sum distance to the query point.As a result, the MRAggQ algorithm is developed according to the two conditions.The proposed MRAggQ algorithm consists of four phases, in which the first and last two phases are in charge of checking the conditions (1) and ( 2), respectively.In the following, we briefly describe the purposes of the four phases and then discuss the details separately.To provide an overview of the MRAggQ algorithm, a flowchart and a pseudo code for the four phases are also given in Figure 3 and Algorithm 1, respectively: • The first phase, the Inner HNO set determining phase, aims at finding, for each cell cell(c), the sets of objects that are enclosed in cell(c) and are within the distance d from each other.Here, we term the object sets found in this phase the Inner HNO sets.• The second phase, the Outer HNO set determining phase, focuses on finding the HNO sets that have not been discovered from the previous phase.It means that the objects constituting a HNO set determined in this phase cannot be fully enclosed in a cell.Instead, the objects are distributed over different cells.We term the HNO sets discovered in this phase the Outer HNO sets.• The third phase, the Aggregate-distance computing phase, is responsible for computing the aggregatedistances of all HNO sets obtained from the previous two phases to each query point contained in the query set Q, according to the type of location-based aggregate queries (i.e., the aggregate-distance may be the average, the minimal, the maximal, or the sum distance).• The last phase, the Result set generating phase, sorts the aggregate-distances of all HNO sets computed in the previous phase, so as to output the HNO set with the shortest aggregate-distance for each query point in Q.

Inner HNO Set Determining Phase
Given the n types of objects stored on the HDFS, the goal of the Inner HNO set determining phase is to process in parallel, determining the Inner HNO sets for each cell cell(c), each of which is composed of n types of objects enclosed in cell(c).In this phase, a MapReduce job consisting of the map step, the shuffle step, and the reduce step is executed to finish the procedure.In the map step, each cell in the form of < cell(c), {o i , (x i , y i )} > (i.e., < key, value > pair) is extracted from the HDFS as input.The pair < cell(c), {o i , (x i , y i )} > generated by the map step is then transmitted to another machine in the shuffle step, where the recipient machine is determined solely by value of cell(c).That is, if the pairs have a common key cell(c), all of them will arrive at an identical machine for processing in the reduce step.This is because for the n pairs < cell(c), {o i , (x i , y i )} > (where i = 1 ∼ n) with the same key cell(c), a set composed of the n objects o 1 , o 2 , ..., o n has a chance to be the Inner HNO set as all the objects are enclosed in the cell cell(c).In the reduce step, two processing tasks are carried out in each participating machine, by taking into account the key-value pairs received from the shuffle step.
• The first task is to compute the distance between any two objects o i and o j enclosed in cell(c), where 1 ≤ i, j ≤ n and i = j, based on their coordinates (x i , y i ) and (x j , y j ).Consider a set of objects {o 1 , o 2 , ..., o n } enclosed in the cell cell(c).If the computed distances of all object pairs are less than or equal to the distance d, then {o 1 , o 2 , ..., o n } is an Inner HNO set of cell(c).Hence, a key-value pair in the form of < cell(c), {{o 1 , (x 1 , y 1 )}, {o 2 , (x 2 , y 2 )}, ..., {o n , (x n , y n )}} > is returned as output.• The second task, as a preliminary to the next phase, the Outer HNO set determining phase, focuses on marking some objects enclosed in cell cell(c) that may constitute an Outer HNO set with the other objects enclosed in different cells.We term the objects determined by the second task the marked objects.For an object o i enclosed in cell(c), it can be the marked object only if the circle centered at o i with radius d is not fully contained in cell(c).Otherwise (i.e., the circle is enclosed by cell(c)), there exists no object enclosed in another cell cell(c ) and whose distance to object o i is less than or equal to d, and thus o i must not be contained in the Outer HNO sets.Suppose that the data space is divided into C × C cells, where each equal-sized cell is represented as a rectangle with widths w x and w y on the x-axis and y-axis, respectively.An object o i with coordinates (x i , y i ) is a marked object in cell cell(c) if the following condition holds: Similar to the first task, a key-value pair with respect to each marked object o i (i.e., < key i , {o i , (x i , y i )} >) will be generated after executing the second task.The generated key is mainly used to guarantee that the n types of objects constituting an Outer HNO set can be processed in the same machine.Note that, if such objects are considered in different machines, some of the Outer HNO sets may be lost.In order to give each marked object o i enclosed in the cell cell(c) a key key i , we first merge C x × C y cells into a rectangle R bounding the cell cell(c), where the parameters C x and C y are estimated based on the following equation: Then, the key of the marked object o i is set to the union of the ids of these cells.To establish better understanding of the main idea behind Equation ( 2), we take the cell cell(4) in Figure 2b as an example, where the user-defined distance d = 2.5 and both the widths w x and w y of each cell are equal to 10.Based on Equation ( 2), a rectangle R consisting of 2 × 2 cells is constructed to enclose the cell cell(4) (here, R can be represented as cell(0, 1, 3, 4), cell(1, 2, 4, 5), cell (3,4,6,7), and cell(4, 5, 7, 8)).Let us consider the rectangle R corresponding to cell(0, 1, 3, 4).As the minimal distance between cell(4) and each of the other three cells, cell(0), cell(1), and cell( 3) is less than or equal to d, it is possible that an Outer HNO set is composed of one or more marked objects in cell( 4) and the rest in the other three cells.As such, we should give all the marked objects enclosed in the rectangle R a common key, cell(0, 1, 3, 4), so as to process them in the same machine.In addition, the keys cell(1, 2, 4, 5), cell (3,4,6,7), and cell(4, 5, 7, 8) are assigned to the marked objects enclosed in their corresponding rectangle R in the same way.

Outer HNO Set Determining Phase
The Outer HNO set determining phase focuses on finding the HNO sets that have not been discovered (i.e., the Outer HNO sets), by exploiting information of the marked objects obtained from the previous phase.Similarly, a MapReduce job is applied in the Outer HNO set determining phase, where (1) the map step receives the result of the previous phase and the key-value pairs are emitted, (2) the shuffle step dispatches the pairs with the same key to an identical machine for checking whether the Outer HNO sets exist, and (3) the reduce step computes the distance between the marked objects to compare with the distance d.Having executed the Outer HNO set determining phase, each key-value pair in the form of < cell(c), {{o 1 , (x 1 , y 1 )}, {o 2 , (x 2 , y 2 )}, ..., {o n , (x n , y n )}} > is returned as output, where c refers to either a cell id (meaning that {{o 1 , (x 1 , y 1 )}, {o 2 , (x 2 , y 2 )}, ..., {o n , (x n , y n )}} > is an Inner HNO set) or multiple cell ids (that is, an Outer HNO set).Continuing the example in Figure 4, the key-value pairs corresponding to an Inner HNO set, < cell(2), {{r 4 , (25, 6)}, {s 4 , (23, 6)}, {t 4 , (24, 4)}} >, and the marked objects, < cell(0, 1, 3, 4), {r 3 , (17, 10)} > and so on, are emitted in the map step of the Outer HNO set determining phase, as shown in Figure 5.In the shuffle step, the marked objects with the common key are assigned to the same machine for computing the distance between any two marked objects based on their coordinates.For instance, five marked objects r 3 , s 3 , t 1 , t 2 , and t 3 with the key cell(0, 1, 3, 4) will be considered in the same machine.In the reduce step, each participating machine computes the distance between the marked objects assigned by the shuffle step (note that only the distances between different types of marked objects are computed), and then outputs the Outer HNO sets.In this figure, the key-value pairs, < cell(0, 1, 3, 4), {{r 3 , (17, 10)}, {s 3 , (17, 9)}, {t 3 , (16, 11)}} > and < cell(1, 2, 4, 5), {{r 3 , (17, 10)}, {s 3 , (17, 9)}, {t 3 , (16, 11)}} > are returned as they satisfy the constraint of distance d.As we can see, {{r 3 , (17, 10)}, {s 3 , (17, 9)}, {t 3 , (16, 11)} is a duplicate set and needs to be eliminated.The duplicate elimination will be carried out in the last phase, the Result set generating phase.

Aggregate-Distance Computing Phase
After executing the first two phases (i.e., the Inner HNO set determining phase and the Outer HNO set determining phase), all of the HNO sets in the space can be discovered in a distributed manner.In the sequel, the third phase, the Aggregate-distance computing phase, is designed to compute in parallel the aggregate-distance of each HNO set according to the type of location-based aggregate queries.Suppose that Q is a set of m query points, q 1 , q 2 , ..., q m , at which a SAvgDQ, a SMinDQ, a SMaxDQ, or a SSumDQ is issued.A query table with respect to Q needs to be broadcast to each machine so as to estimate the aggregate-distances between the HNO sets processed by this machine and each query point in Q.Each tuple of the query table has two fields: the query id q j i (where j can be 1, 2, 3, and 4, indicating SAvgDQ, SMinDQ, SMaxDQ, and SSumDQ, respectively) and the coordinates (x q i , y q i ).In the map step of the Aggregate-distance computing phase, in addition to the key-value pair < cell(c), {{o 1 , (x 1 , y 1 )}, {o 2 , (x 2 , y 2 )}, ..., {o n , (x n , y n )}} > for each HNO set, a key-value pair < cell(c), {{q j 1 , (x q 1 , y q 1 ), v 1 }, {q j 2 , (x q 2 , y q 2 ), v 2 }, ..., {q j m , (x q m , y q m ), v m }} > with regard to the query points is also emitted, so that the query set Q can be transmitted along with each HNO set to the same machine for query processing.Having executed the shuffle step, the HNO set {o 1 , o 2 , ..., o n } and the query set {q 1 , q 2 , ..., q m } with the same key cell(c) are grouped together.For each participating machine, the task of computing the aggregate-distance between each HNO set and each query point assigned by the shuffle step is carried out in the reduce step, in which the aggregate-distance refers to the average, minimal, maximal, or sum distance according to the query type (i.e., the value of j).Finally, each key-value pair in the form of < q j i , {(o 1 , o 2 , ..., o n ), d agg } > is returned as output, where d agg is the aggregate-distance between the HNO set {o 1 , o 2 , ..., o n } and the query point q i .As shown in Figure 6, continuing the example of Figure 5, the query table maintains four query points q 1 to q 4 with their coordinates and query types, in which q 1 1 , q 2 2 , q 3 3 , and q 4 4 issue the SAvgDQ, the SMinDQ, the SMaxDQ, and the SSumDQ, respectively.In the map step, the key-value pairs < cell(0, 1, 3, 4), {{r 3 , (17, 10)}, {s 3 , (17, 9)}, {t 3 , (16, 11)}} >, < cell(1, 2, 4, 5), {{r 3 , (17, 10)}, {s 3 , (17, 9)}, {t 3 , (16, 11)}} >, and < cell(2), {{r 4 , (25, 6)}, {s 4 , (23, 6)}, {t 4 , (24, 4)}} > obtained from the previous phase (i.e., the Outer HNO set determining phase) are emitted.For the sake of grouping the HNO sets and the query points, the key-value pairs, < cell(0, 1, 3, 4), {{q 1  1 , (26, 4)}, {q 2 2 , (6, 17)},

Result Set Generating Phase
The goal of the last phase, the Result set generating phase, is to determine the HNO set with the shortest aggregate-distance for each query point in a distributed manner.Once a MapReduce job starts, the key-value pairs < q j i , {(o 1 , o 2 , ..., o n ), d agg } > received from the previous phase are directly emitted in the map step.According to the key q j i , the HNO sets having the same q j i will be assigned to an identical machine in the shuffle step because their aggregate-distances to the query point q i need to be compared so as to determine the query result for q i .For the machine receiving the key-value pairs with respect to q i , the first task of the reduce step is to eliminate the duplicate value in the form of {(o 1 , o 2 , ..., o n ), d agg }.Then, the second task is to sort the HNO sets in ascending order of their aggregate-distance d agg , and finally output the HNO set with smallest d agg as the query result.
Figure 7 gives an illustration of how the Result set generating phase is executed using the key-value pairs generated from the previous phase (shown in Figure 6).In the map step, all key-value pairs which use the query id as the key (e.g., < q 1 1 , {(r 3 , s 3 , t 3 ), 11.1} >) are emitted so that the pairs with the same key can be grouped together for processing after the shuffle step.For instance, the pairs < q 1  1 , {(r 3 , s 3 , t 3 ), 11.1} > and < q 1 1 , {(r 4 , s 4 , t 4 ), 2.61} > are assigned to the same machine because of their common key q 1 1 .By executing the reduce step in each machine, the duplicates are first removed and then the HNO set with the shortest aggregate-distance for each query point is output.In this figure, the HNO set {r 4 , s 4 , t 4 } is the result for the query point q 1 , and the HNO set {r 3 , s 3 , t 3 } is the result for the other three query points q 2 , q 3 , and q 4 .

Effect of Parameter α
The first set of experiments studies the effect of the number of objects enclosed in each cell (i.e., the parameter α) on the performance of processing the location-based aggregate queries, using the Uniform dataset and the Manchester dataset.In the experiments, we vary the value of the parameter α from 250 to 2000 and evaluate the average running time for the proposed MRAggQ algorithm.For both the Uniform dataset and the Manchester dataset, the average running time first decreases and then increases with the increasing value of α, as shown in Figure 8a,b, respectively.This is mainly because (1) for a smaller α (i.e., fewer number of objects in each cell), more cells need to be generated for storing object information, and thus each participating machine (i.e., the DataNode) spends more time on processing the increasing number of cells assigned by the NameNode, while (2) for a greater α (meaning that the number of cells decreases but the storage overhead for each cell increases), computing the distances between objects to determine the Inner and the Outer HNO sets with respect to each cell takes more processing time.As the parameter α dominates the performance of processing the location-based aggregate queries, we need to decide an appropriate value of α used to partition the data space.As we can see, for both the Uniform dataset and the Manchester dataset, the average running time increases noticeably after α = 1000.The experimental result shows that α = 1000 is a better choice than the others, and hence will be used as the default value in all the rest experiments.

Effect of Number Of Objects
The second set of experiments illustrates the performance of processing the location-based aggregate queries using the Uniform dataset (in which the number of objects varies from 1000 K to 5000 K) and the real dataset (including the Beijing, Manchester, Pittsburgh, and Charlotte files).As shown in Figure 9a, the average running time for the MRAggQ algorithm increases with the increasing number of objects.The reason is that a larger number of objects results in more cells to be processed, so that a majority of the running time is spent on executing the Inner HNO set determining phase and the Outer HNO set determining phase.Nevertheless, benefited from processing the location-based aggregate queries in a distributed manner, the average running time for all cases remains below 0.25 s.As for the real dataset, shown in Figure 9b, the Beijing file contains fewer objects than the Manchester, Pittsburgh, and Charlotte files, but incurs the highest average running time.This is because the Beijing file has a denser object distribution (compared to the other three files), thus leading to more HNO sets to be considered in the Aggregate-distance computing phase and the Result set generating phase.

Effect of Number of Object Types
The third set of experiments is conducted to investigate the impact of the number of object types (i.e., the value of n) on the performance of the MRAggQ algorithm.Figure 10a,b measure the average running time of the MRAggQ algorithm for the Uniform dataset and the Manchester dataset, respectively, by varying n from 1 to 5. In the case where n = 1, implying that only single type of objects is considered, the processing cost required for determining whether the road distance between different types of objects exceeds the distance d can be completely avoided (that is, the first two phases of the MRAggQ algorithm do nothing).Moreover, the problem of processing the SAvgDQ, the SMinDQ, the SMaxDQ, and the SSumDQ is reduced to finding the nearest neighbor of the query object (i.e., the object with the shortest distance).In the case that n gets larger than 1, the Inner HNO set determining phase and the Outer HNO set determining phase need to be executed to find the HNO sets, as more than one type of object is processed.This is why the average running time of processing the location-based aggregate queries grows as the value of n increases.The experimental results also show that (1) the nearest neighbor query is a special case of location-based aggregate queries, where d = ∞ and n = 1, and (2) the proposed MRAggQ algorithm can be successfully applied to process the nearest neighbor query in a distributed manner.

ji∈
O i , i = 1 ∼ n, and j = 1 ∼ m.Given a query point q, a set of objects {o j these m HNO sets is determined, such that for the SAvgDQ, the average distance of {oj |j = 1 ∼ m},where d(q, o j i ) refers to the distance between objects o j i and q.for the SMinDQ, the distance of an objecto j i ∈ {o j 1 , o j 2 , ..., o j n } to q is equal to min{min{d(q, o j i )|i = 1 ∼ n}|j = 1 ∼ m}.
is the shortest distance that, starting from q, visits each object in {o j 1 , o j 2 , ..., o j n } exactly once.

Figure 1 .
Figure 1.Example of processing the location-based aggregate queries.(a) Heterogeneous objects; (b) Multiple queries.

Algorithm 1 :
The MRAggQ algorithm Input : The n types of objects, and the set of m query points Output: The result HNO set for each query point /* The Inner HNO set determining phase */ finding the Inner HNO sets enclosed in cell(c); determining the marked objects for cell(c); /* The Outer HNO set determining phase */ finding the Outer HNO sets based on the marked objects; combing the Inner and the Outer HNO sets; /* The Aggregate-distance computing phase */ computing the average, min, max, or sum distances of the HNO sets to the m query points; /* The Result set generating phase */ sorting the HNO sets according to their distances to each query point; returning the HNO set with the shortest distance to each query point;

Figure 4 .
Figure 4. Illustration of the Inner HNO set determining phase.

Figure 6 .
Figure 6.Illustration of the Aggregate-distance computing phase.

Figure 9 .
Figure 9.Effect of the number of objects.(a) Uniform dataset; (b) Real dataset.

Figure 10 .
Figure 10.Effect of the number of object types.(a) Uniform dataset; (b) Manchester dataset.
Illustration of the Outer HNO set determining phase.