DISPAQ: Distributed Profitable-Area Query from Big Taxi Trip Data †

One of the crucial problems for taxi drivers is to efficiently locate passengers in order to increase profits. The rapid advancement and ubiquitous penetration of Internet of Things (IoT) technology into transportation industries enables us to provide taxi drivers with locations that have more potential passengers (more profitable areas) by analyzing and querying taxi trip data. In this paper, we propose a query processing system, called Distributed Profitable-Area Query (DISPAQ) which efficiently identifies profitable areas by exploiting the Apache Software Foundation’s Spark framework and a MongoDB database. DISPAQ first maintains a profitable-area query index (PQ-index) by extracting area summaries and route summaries from raw taxi trip data. It then identifies candidate profitable areas by searching the PQ-index during query processing. Then, it exploits a Z-Skyline algorithm, which is an extension of skyline processing with a Z-order space filling curve, to quickly refine the candidate profitable areas. To improve the performance of distributed query processing, we also propose local Z-Skyline optimization, which reduces the number of dominant tests by distributing killer profitable areas to each cluster node. Through extensive evaluation with real datasets, we demonstrate that our DISPAQ system provides a scalable and efficient solution for processing profitable-area queries from huge amounts of big taxi trip data.


Introduction
Internet of Things (IoT) technology enables interconnections between large volumes of distributed and heterogeneous smart devices allowing them to communicate seamlessly with users. Recently, IoT devices such such as sensors, global positioning systems (GPSs), and cameras have become widely used in transportation industries. For example, several countries such as the USA [1], Germany [2], Japan [3] and Korea [4], collect diverse data from taxis equipped with IoT devices. Data science includes the effective translation of data into novel insights, discoveries and solutions [5]. Big data analytics as a big part of data science enables us not only to provide intelligent services to customers, but also to improve work efficiency and profitability of taxi drivers by analyzing the collected data.
Finding good taxi strategies for improving services and profits is one of the core applications in smart transportation [6]. Most existing approaches analyze collected GPS sensor data to extract taxi strategies, e.g., increasing traffic system efficiency [7], measuring graph-based efficiency of taxi services [8], understanding service strategies such as searching for passengers, passenger delivery, and service area preference [6], plus finding good locations based on minimum cruising time [9][10][11], maximum profit [12], minimum cruising distance [10] and/or high passenger demand [13][14][15][16][17]. Broadly speaking, we believe that these approaches are intended to find high-profit locations (occasionally, we use the terms "high-profit locations" and "profitable areas" interchangeably) for taxi drivers although different methods have been proposed.
For passenger search-strategy improvements, a great deal of research has been done on finding profitable areas [9,[13][14][15][16][17][18][19][20]. However, we observed that most of the existing solutions, which are based on clustering techniques [13,14,16,17] or statical techniques such as autoregressive integrated moving average (ARIMA) [10], chi-square distribution [18], statistical learning [11], predictive distribution [15] and probability model [19], only consider one or two factors for finding profitable areas, although it is well known that various factors influence finding profitable areas. A profitability map approach [9] and a recommendation system approach [20] utilize multiple factors to find profitable areas. However, all existing approaches utilize a relatively small amount of taxi trip data, which fits into the memory of one machine.
A motivating example could intuitively illustrate the challenges for finding profitable areas.
Example 1. Consider a New York City taxi driver working in the areas shown in Figure 1. The taxi driver picks up a passenger in area A and delivers him to area B. After dropping the passenger in area B, he wants to find a new passenger, by either staying in area B or going to another area. For simplicity, we assume that the driver has four candidate profitable areas in which to search for new passengers: (1) move to area H near a subway station, (2) move to area I near a shopping district, (3) stay in area B, or (4) move to area G near a residential district.
In this example, we assume that three factors affect finding profitable areas: (1) profit, (2) cruising time, and (3) cruising distance. Figure 1b describes the example values for each candidate profitable area. If we consider profit factor only, then area I is the best location, since it has the highest profit. Area B could qualify as the best location when we consider both cruising time and cruising distance. If we consider three factors simultaneously, areas B, H, and I should be considered profitable areas. We see that all values for area G are worse than those of areas B, H, and I. Thus area G cannot be a profitable area. However, we cannot decide which one is better among areas B, H, and I. This is a typical scenario for the skyline query processing approach [21].
In addition, we can provide better suggestions to taxi drivers if a profitable area query system relies on their past experiences as recorded in taxi trip data. Then, it is necessary to compute huge volumes of taxi trip data, because the amount of data increases quickly, especially with the numerous taxis that are active in a big metropolitan city. In order to build an efficient and scalable profitable areas query system, we need to address the following three challenges motivated by Example 1: (1) efficiently answer profitable-area queries, (2) find profitable areas by considering several factors simultaneously, and (3) deal with huge volumes of taxi trip data.
To address the above challenges, we propose a query processing system, called DISPAQ, which facilitates parallel processing by combining the Apache Software Foundation's Spark framework [22,23] and a MongoDB database [24]. First, to answer profitable-area queries efficiently, we devise a spatial-temporal data structure, which is called a profitable-area query index (PQ-index). The PQ-index is a hash-based index that consists of two major components: (1) spatio-temporal hash keys and (2) extended route summaries. An extended route summary is combinations of area summaries and route summaries, where an area summary contains beneficial information about the area and a route summary manages expense information from the route. A query processor in DISPAQ utilizes the PQ-index to obtain candidate profitable areas. Second, we consider the problem of finding profitable areas with multiple factors under skyline query processing [21]. However, a pairwise point-to-point dominance test in skyline processing is a time-consuming process, so we exploit a Z-skyline method [25] which uses a Z-order space filling curve to cluster data points into blocks. The Z-skyline approach can guarantee refining candidate profitable areas by checking for dominance. A dominated area means that all values of the factors of the area are worse than others. Third, to deal with large volumes of taxi trip data, we propose distributed processing to retrieve final profitable areas. Thus, the construction of the PQ-index and the Z-skyline approach are implemented as a distributed way by using Spark and MongoDB. In addition, we devised an optimized shuffling of block-pruning data, which maximizes dominated-area elimination by sending killer areas to every node in the cluster.
This paper is an extended version of our previous publication [26]. We extend our previous work as follows: First, we provide a complete formal definitions for a profitable-area query. Second, we present comprehensive algorithms for constructing the PQ-index and distributed processing of profitable-are queries. In addition, we discuss the proofs of theorems to validate the correctness of the algorithms and the complexity of the distributed algorithms. Finally, we perform experiments to demonstrate the efficiency of the DISPAQ system. We conduct an extensive performance evaluation with real taxi trip data sets from New York City and Seattle, USA.
Overall, the main contributions of this paper as a crucial part of data science can be summarized as follows: • We proposed a distributed profitable-area query process system, called DISPAQ, for huge volumes of taxi trip data. The main goal of DISPAQ is to provide valuable profitable area information to users, which is one of main activities of data science.

•
To quickly retrieve candidate profitable areas, DISPAQ organizes multiple factors about a profitable area into a spatial-temporal index called PQ-index. We define and extract multiple factors from the raw taxi trip dataset collected GPS sensors. • DISPAQ executes an efficient Z-skyline algorithm to refine candidate profitable areas. The Z-skyline algorithm could reduce unnecessary dominance tests and avoid pairwise dominant tests. The Z-skyline approach is implemented as a distributed algorithm to manage big taxi trip data.

•
We propose an optimized method for distributed Z-Skyline query processing by sending killer areas to each node, which maximizes the filtering of dominated areas.

•
We conduct extensive experiments on a large scale two real datasets from New York City and Chicago to determine the efficiency and effectiveness of DISPAQ. We compared our Z-Skyline query processing method with two basic skyline methods (block-nested looping and divide-and-conquer) in a distributed approach. The experimental results show that our approach outperforms the existing methods.
The remainder of this paper is organized as follows. Section 2 describes the related research work, and Section 3 provides the preliminaries of DISPAQ. Section 4 provides the design of PQ-index and the detailed steps for constructing it. In Section 5, we explain how to process a profitable-area query by exploiting the Z-skyline algorithm using a combination of Spark and MongoDB. Section 6 presents the performance evaluation results and a comparison of DISPAQ with its competitors. Finally, Section 7 concludes the paper.

Related Work
In this section, we briefly survey existing approaches and highlight their differences compared to our DISPAQ system. We broadly group the approaches into three categories based on functionality: (1) taxi passenger searching strategy, (2) taxi information data structure, and (3) skyline query.

Taxi Passenger Searching Strategies
As one of the crucial goals of taxi drivers is to carry as many passengers as possible, a variety of ways of finding highly profitable areas or recommending hot spots of taxi users have been suggested. Roughly, we can categorize the previous work into four categories: (1) clustering-based approaches, (2) statistical approaches, (3) specialized model approaches and (4) machine learning-based approaches. An extensive survey on mining taxi GPS traces can be found in [27].
In the first category, previous solutions extracted patterns of taxi drivers and predicted high-profit areas or routes as the result of passenger searching [13,14,16,28]. Lee et al. [13] utilized a K-means clustering method to extract hot-spots from historical GPS taxi data. To discover the taxi demand distribution and predict hot-spots, the iTaxi system studied the effects of three clustering methods: K-means clustering, agglomerative hierarchal clustering, and density-based spatial clustering of applications with noise (DBSCAN) [14]. Zhang et al. [28] proposed a novel spatio-temporal clustering algorithm to recommend top-5 high-profit pickup areas. Recently, an improved DBSCAN algorithm [16] was proposed to recommend hot spot-based routes by analyzing short-dated GPS sensor data. However, the above research considered only the passenger demand when recommending high-profit areas, and did not consider the big data issues when dealing with large volumes of GPS sensor data.
In the second category, several research projects built prediction models for the passenger search problem [10,11,15,[29][30][31][32]]. An improved auto-regressive integrated moving average (ARIMA) scheme [10] forecasts high taxi-passenger-demand spots by using GPS traces. To predict the spatio-temporal distribution of taxi passenger demand, an online recommendation system based on time series forecasting techniques is proposed [29]. The same authors proposed a short-term time series prediction model for the number of services at a given taxi stand using streaming data [30]. T-Finder [11] is another recommendation system for both taxi drivers and passengers, which exploits a probabilistic model constructed from GPS trajectories of taxis. An incremental ARIMA model [15] predicts high passenger-demand spots by employing a learning model based on historical GPS data. Dong et al. [31] proposed a recommendation system by using linear equations to compute the score of the each road segment. To find out the max-score cruising route, they also uses a skyline computation to reduce the search space. However, they focused on recommending routes not profitable area and did not touch the issues of big data. SCRAM [32] aims to provide recommendation fairness for a group of competing taxi drivers. It utilized the expected driving cost (EDC) function with complex event probabilities. The above-mentioned methods mainly regard taxi trip data as time series and suggest recommendation systems based on time series prediction models. However, DISPAQ focuses on the profitable-area query processing which requires to efficiently manage a huge volumes of big taxi trip data. The distinctive feature of DISPAQ system is that it returns a set of profitable areas not just one profitable area.
In the third category, several specialized models [9,20] are utilized for determining the next cruising location. Powell et al. [9] defined a profitability score to construct a spatial-temporal profitability map.
Their system suggests profitable locations to reduce the cruising time of taxicabs by using a fixed complex profitability formula. However, they do not tackle the issues when dealing with huge volumes of taxi trips. Our DISPAQ system also uses the concept of the profitability map. However, DISPAQ first constructs a PQ-index from raw taxi trip data. The distributed construction algorithm is proposed to handle a huge volumes of taxi trips. The profitability map of DISPAQ includes candidate profitable areas by searching the PQ-index. By exploiting the PQ-index, DISPAQ can efficiently reduce the search spaces. Then, the distributed skyline query processing method is applied to the profitability map to refine candidate profitable areas. Due to the skyline concept, DISPAQ returns a set of comparable profitable areas not just one profitable area. Recently, two location-to-location graph models [20] such as an OFF-ON model and an ON-OFF model were adopted to recommend the next cruising location by considering three factors. Although these two model considers three factors, they mainly relies on the transition probability from one location to another location. When dealing with huge volumes of taxi trips, two graph models cannot fit into a memory, thus the performance will be degraded. However, this method do not tackle this big data issue when dealing with huge volumes of taxi trips.
In the fourth category, several machine learning based approaches have been studied [33][34][35][36]. Time series analysis techniques based on non-homogeneous Poisson processes are utilized to predict short-term approximate local probability density functions of taxi stands [33]. DeepSD [35] exploits a novel deep neural network structure for short-term prediction on the gap between the car-hailing supply and demand in a certain area. A reinforcement learning based system [36] is developed to learn from real trajectory logs of taxi drivers and to recommend the profitable locations to the drivers. PRACE [34] is a deep learning based taxi recommender system for finding passengers. It executes a prediction task as a multi-classification problem rather than a regression problem. As mentioned in the paper [37], deep learning technologies are good at predicting over uncertain events. Since our research is conducted based on a profitable-area query processing system, thus we mainly focus on the efficient distributed algorithms which utilize the PQ-index and the skyline query concept. However, we believe that the above-mentioned deep learning methods could supplement our DISPAQ.
The goal of our DISPAQ system is similar to the aforementioned studies. However, our approach is different from the existing work in the following aspects: (1) We build a PQ-index for maintaining profitable area-related information. (2) We extend skyline query processing to retrieve profitable areas by considering multiple factors. (3) We devise a distributed algorithm for handling huge volumes of taxi trip data.

Taxi Information Data Structure
Another related topic of this paper is to build efficient data structures for handling and analyzing taxi information. Generally speaking, tree-based index, hash-based index or specialized data structures are exploited to efficiently maintain taxi information.
Several research attempts have been made to manage taxi information based on tree-based or hash-based indexes [38][39][40][41][42][43][44]. An adaptive quadtree [40] was used to store a trajectory data set, and a combination of BPR-Quadtree and a minhash index [41] was built for storing historical trajectory data. A kd-tree was utilized to provide passengers with expected fare and trip duration [39] or to visualize New York City taxi trips by treating each taxi trip as a point in a k-dimensional space [38]. A light-weight spatial index based on geohash [42] was constructed to answer basic spatial queries such as containing, containedIn, intersects and withinDistance. The authors implemented the geohash index on San Francisco taxi traces. T-Share [43] is a taxi ride-sharing service that uses a spatio-temporal grid index to store an ordered taxi list in a location based on distance and arrival time. Huang et al. [44] suggested a kinetic tree to dynamically match realtime trip requests to servers in a road network to allow real-time ridesharing. A GPU-based index [45] was proposed to support complex spatio-temporal queries over large, histrorical data, which is a generalization of the kd-tree. The complex spatio-temporal queries are basically select-from-where style queries which efficiently utilize the GPU-based index. However, the core operation of profitable-area queries is the dominance test which requires not the generalized GPU-based index but the specialized PQ-index proposed in this paper.
At the second category, several specialized data structures are devised to efficiently manage taxi information [46][47][48][49][50]. Nanocube [46] is a in-memory data cube structure for easily generating visual encodes such as heatmaps, histograms, and parallel coordinate plots from spatio-temporal datasets including taxi trips. However, it was only designed to answer queries from interactive visualization systems, thus it does not allow profitable-are queries. A frequent trajectory graph [47] was invented to handle trajectory information for finding areas of high taxi-passenger demands. The querying and extracting timeline information system [48] builds a timeline query index (TQ-index) to manage traffic information according to a timeline model. A time-evolving origin-destination (O-D) matrix [49] deals with a continuous stream of GPS traces and maintains accurate statistics of interests. The O-D matrix focuses on monitoring the evolution of urban dynamics from GPS traces, whereas DISPAQ was designed to provide a distributed profitable-area query system. SigTrac [50] extracts traffic matrices from traffic sensor data and exploits a singular value decomposition (SVD) technique to process only traffic similarity queries.
Our DISPAQ system constructs a specialized index structure called a PQ-index. The PQ-index consists of extended route summaries, which are combinations of area and route summaries from raw taxi trip data to efficiently answer profitable-area queries. In addition, different from the above approaches, DISPAQ could build and utilize the PQ-index in a distributed way for handling huge amounts of taxi trip data from GPS sensors.

Distributed Skyline Query Processing
Since DISPAQ extends a skyline query processing algorithm to support profitable area queries, we briefly explain research efforts in distributed skyline query processing.
Several researchers have proposed processing skyline queries in a distributed way [51][52][53]. Afrati et al. [52] investigated parallel skyline processing based on a massively parallel (MP) model that requires the data to be perfectly load-balanced. A novel, enhanced distributed dynamic skyline (EDDS) technique [51] was proposed and implemented for wireless sensor networks. Zhou et al. [53] investigated probabilistic skyline queries over uncertain data in distributed environments. These researchers proposed solutions based on their own models, whereas DISPAQ utilizes the distributed processing functionalities of Spark [22,23] to answer profitable-area queries.
Some researchers focused on computing skyline queries using MapReduce framework [54][55][56][57][58]. Generally, MapReduce-based skyline processing consists of two parts: (1) computing local skylines and (2) finding global skylines. Since centrally finding global skylines from local skylines would bottleneck the whole process, various partitioning techniques were proposed. Zhang et al. implemented MapReduce-based block-nested looping (MR-BNL), MapReduce-based sort-filter skyline (MR-SFS), and MapReduce-based bitmap (MR-Bitmap) approaches [54]. MR-BNL and MR-SFS showed better performance in most cases, although they don't work well for high dimensional data due to point-to-point dominance tests. An MR-Angle approach [55] used grid partitioning of the data space to reduce the processing time when selecting optimal skyline objects. A SKY-MR method [57] built a sky-quadtree and a risky-quadtree to effectively prune non-skylines and non-reverse skylines. This pruning method also has a role in load-balancing computations. Mullesgaard et al. designed a grid-partitioning technique to divide data space into partitions, and represented each partition as a bitstring [56]. The bitstring helps prune partitions that cannot have skyline tuples. Recently, Koh et al. [58] proposed dominator reduction rules for limiting the number of dominance tests, and a data sample-selection algorithm for optimizing a local skyline process.
Our DISPAQ system is different from the aforementioned approaches in the following aspects. First, we focus on retrieving profitable areas based on distributed skyline query processing. Second, we obtain candidate profitable areas by exploiting the PQ-index which limits the search space. Third, we utilize a Z-Skyline algorithm to refine candidate profitable areas. Due to a monotonic ordering property and an automatic clustering property, Z-skyline avoids unnecessary dominant tests and pairwise dominance tests among profitable areas.

Preliminaries
In this section, we first present the frequently used notations in the paper. Then, we provide basic definition of taxi trip data and explain the overall architecture of DISPAQ system.

Notations
For reference, Table 1 shows our frequently used notation. Each definition will reintroduced when first used in the paper.

Taxi Trip Data
Recently in many urban cities taxis have been equipped with GPS sensors for recording trip information. Thus, we conducted our study on two real-world datasets collected in New York City and the City of Chicago. Table 2 shows a snippet of the New York City taxi trip dataset. Each row in the dataset describes a distinct taxi trip including locations, time stamps and taxi fare information.
The formal definition of a taxi trip is as follows.
Definition 1. (Taxi trip) Each taxi trip T is denoted as an 8-tuple (t p , t d , l p , l d , d, f a, tia, toa), where t p and l p refer to the pickup time/location at the beginning of a trip, t d and l d are the drop-off time/location at the end of the trip, d means the distance of the trip, f a is the fare amount, tia is the tip amount, and toa is the tolls amount.
Since large numbers of taxi trips contain wide variations of GPS coordinates, we utilize geohash [59] to divide geographic regions into a hierarchical structure. Geohash separates areas based on grid cells using Z-order curve, which enables us to divide and merge areas (regions) elastically. Thus, we can easily compute aggregate values in the region due to the characteristics of the geohash and reduce the computation time for obtaining aggregated values in the region. We formally define an area based on geohash as follows.

Definition 2.
(Area) An area is regarded as a group of exact locations, which is defined as ar = geohash(l lat , l long , len), where l lat and l long are latitude and longitude of a location, and len is the length of geohash code for the area.
Note that a location is an exact point for a taxi trip and an area means a region that might include several taxi trips.
As we can see, each taxi trip has two locations: pickup and drop-off. Generally, the actual route of a taxi trip requires many GPS coordinates from the pickup area to the drop-off area. However, in this paper, we define a route, rt, of each taxi trip as the pair (pickup area, drop-off area). Since an area is denoted as a geohash code, route rt is also represented by two geohash codes. The aforementioned definitions are explained in Example 2.
Example 2. The first row (the first taxi trip T 1 ) in Table 2 has (−73.98278, 40.75492) as the pickup location and (−73.18142,40.68773) as the drop-off location. Assume that the length of the geocode is 7. This means each area covers approximate 150 m × 150 m region. If we apply the geohash algorithm, then the geohash values of the pickup location and drop-off location are represented as dr5ryp and dr5ryn, respectively. Thus, the route of T 1 is a pair of origin and destination areas (dr5ryp, dr5ryn). To simplify area and route for another example, we change area ar B as B and route rt dr5ryp−dr5ryn as B − G. This notation and visualization are illustrated in Figure 1 Figure 2 shows the high-level architecture of DISPAQ. The key components of DISPAQ are the PQ-index constructor, the query processor, the Hadoop Distributed File System (HDFS) and the MongoDB document store. The PQ-index constructor transforms raw taxi trip into the aggregated values, then builds a PQ-index based on area information and route information. The query processor executes a profitable-area query with the current location and time from a user. The results are returned to the user in two steps: (1) a profitability map computation phase and (2) a refinement phase for pruning candidate profitable areas. DISPAQ exploits the parallel processing of Spark and a MongoDB NoSQL document store: HDFS stores the raw taxi trip data and the MongoDB stores and utilizes the PQ-index. One of the key characteristics of DISPAQ is distributed processing of profitable-area queries by combining Spark and the MongoDB document store. Figure 3 depicts three main physical components of DISPAQ: (1) the client, (2) a commodity server as Spark Master and (3) commodity servers as work nodes. The profitable-area query processing mainly relies on Spark. The client has a Spark driver application, which receives a profitable query from a user and sets a Spark configuration. The Spark driver manages the job flow, schedules tasks, and is available the entire time the application is running. When the configuration is completed, the configuration information is sent to one of the commodity servers that includes a cluster manager in Spark working as a master node. The other commodity servers working as slave nodes have executors, which are responsible for executing work in the form of tasks, as well as for storing any data. Specifically, these executors construct a PQ-index and also execute profitable-area queries. MongoDB stores the PQ-index across commodity servers (shards). Thus, one of the commodity servers becomes a MongoDB master (mongos) and a config sever at the same time. The other servers are MongoDB shards, which store a subset of the PQ-index.   With this high-level overview of the system, we now explain the process of PQ-index construction and processing of profitable area queries.

Constructing a Profitable Area Query Index
This section presents our profitable area query index (PQ-index) and explains how to build a PQ-index from raw taxi trip data.

Components of the PQ-Index
Our DISPAQ system executes a profitable-area query in two steps: (1) collecting candidate profitable areas into a profitability map and (2) refining the candidate areas via extended skyline query processing. Since the values of raw taxi trip data can change dynamically, depending on the current time and location, it is difficult to obtain a profitability map immediately without checking all possible candidates. The intuition in the PQ-index is to pre-compute all possible combinations of candidate areas before executing a user query.
A PQ-index is a hash-based spatio-temporal index structure that maintains aggregated taxi trip information for retrieving candidate profitable areas efficiently. The PQ-index consists of three major components: (1) a spatio-temporal hash key, which helps to quickly identify aggregated taxi trip information; (2) an area summary, which contains calculated profits from an (origin) area at a particular time; and (3) extended route summaries, which are combinations of route summaries and (destination) area summaries for managing computed profits of routes in an area. The profits are calculated by considering average benefits and expenses of routes from the area. Figure 4 depicts a logical (conceptual) design of the PQ-index. As explained, a spatio-temporal hash key has two main elements connected by two pointers: (1) an orange box connecting to an origin area summary and (2) a green box connecting to extended route summaries. We now describe each of these four PQ-index components in detail.

Spatio-Temporal Hash-Key Definition
The PQ-index has a spatio-temporal hash key as a pair (time period, area code): an area code records the geohash code of a location; and a fixed time interval is used as the time period. Since a profitable area needs two input parameters, as explained in Definition 7, we decide the pair (time period, area code) as a hash-key of the PQ-index. For each spatio-temporal hash-key, the PQ-index stores computed profits of routes and an area into an extended route summary.
An area code and a time period are used as major input parameters for the summarization because the aggregated values differ from one area to another at different times. As explained in Definition 2, we use a geohash code to denote a specific group of locations, since areas are static. On the other hand, a time period is a dynamic feature, which should be determined after analyzing raw taxi trip data. Figure 5 depicts the distributions of New York taxi trip data in one specific area on Fridays during September 2015. When the size of a time period is set to one minute, as shown in Figure 5a, the total number of time periods (bucket number) is 1440 (=60/h × 24 h). The average value of trips (x n 1 ) per time period is 2.6, and the maximum number of taxi trips in a time period is only 12. When the size of a time period is 30 min, as shown in Figure 5d, the total number of time periods is 48 (=2/h × 24 h), and the average value of trips per time period (x n 30 ) is 79.2. Another consideration when deciding the size of a time period is average travel time from taxi trip dataset samplex tt . The average taxi driver finishes a trip in 14 min, according to the NewYork taxi trip dataset. Thus, we should set a reasonable value to the size of a time period. Equation (1) explains how to determine the size of a time period (=time interval) by simultaneously considering two features such as the average taxi trip frequency during the time period and the average travel time for a taxi trip data set. In other words, the size is set to the minimum value of i that satisfies two conditions: (1) the average value of trips (x n i ) should be larger than the multiplication of a number of candidate areas, n PA , by a frequency n f req ; and (2) it should be smaller than the average travel time of taxi trips (x tt ). The time period is a basic unit of DISPAQ for retrieving profitable areas. For example, if a taxi driver specifies a query at 09:57 in New York City, it belongs to time period [09:51, 10:00]. Then, DISPAQ provides several profitable areas using 10-minute intervals, which can be computed from the current time.

Area Summary
Since an extended route summary is a combination of an area summary and a route summary, we shall provide detailed explanations for these summaries. We begin with an intuitive observation. Taxi drivers plan their own routes after dropping off a passenger. They would like to select an area that guarantees high average fares and high passenger demand with a short waiting time. Their decisions for making high profits depends on area and time. The driver may know some candidate areas from his/her previous experience with the current location at a current time. Then, they might estimate taxi-passenger demand in candidate areas. Finally, they decide on one area for high profits according to past experiences.
To resemble a taxi driver's decision process, a PQ-index needs two pieces of summary information. An area summary maintains all candidate areas that are computed from raw taxi trip data. For quickly identifying candidate profitable areas, we computed values with all combinations of (area, time) pairs. The PQ-index also utilizes the pair (area, time) as a spatio-temporal hash key.
Based on the above observation, we formally define an area summary as follows. Given the formal definition for an area summary, we shall explain how to calculate the elements of the area summary. Note that a pair comprising area ar and a particular time tp is a spatio-temporal hash key for locating elements of an area summary.
Equation (2) explains how to compute the average fare from a taxi trip dataset. First, we compute a total sum of fares by summing up fare amount f a and tip amount tia from each taxi trip. Then, we divide this sum by the total number of taxi trips that start from area ar in time period tp.
The second element of an area summary is a list of pickup probabilities at each time during the time period. We can obtain a pickup probability for each time point t i in time period tp as shown in Equation (3). Figure 6 illustrates how to compute pickup probabilities. Assume that time period tp is an interval from t 1 to t n . During time period tp, several trips could start from area ar. For example, two taxi trips T 1 and T 4 start at time point t 1 , and taxi trip T 2 begins at time point t 2 . We store the number of taxi trips for each time point t i into n ar,t i . The total number of trips during time period n ar,tp is a summation of all taxi trips n ar,t i . Each t i ∈ tp has a possibility to become the beginning time of a trip. Thus, we calculate the probability of each time point t i by dividing n ar,t i by n ar,tp . Passenger demand is a probability defined in Equation (4). We can obtain this value by dividing the number of trips that started from ar in time period tp by the number of trips from all areas in time period tp. AS ar,tp .pd = n ar,tp n tp (4) Example 4. Figure 7 illustrates how to compute an area summary from a snippet of taxi trips. These taxi trips are the same dataset from Table 2

Route Summary Calculation
Since the taxi trip dataset includes millions of routes, there exist several routes that have the same pickup area and drop-off area. These repeated routes can be summarized to provide valuable information when deciding on profitable areas. This leads us the following definition for a route summary. Based on Definition 5, we calculate elements of a route summary from repeated taxi trips. Note that route rt and time period tp play a key role in identifying a route summary. The average distance is calculated with Equation (5). We compute the total sum of trip distances from the repeated routes and divide it by the number of routes (n rt,tp ).
The average travel time can be computed with Equation (6). For each taxi trip i, we first compute travel time by subtracting pickup time t p i from drop-off time t d i . The total travel time is the summation of the travel time from each route rt during time period tp. Then, we divide the total travel time by the total number of routes (n rt,tp ) to obtain the average travel time.
An average expense is computed with Equation (7). Since taxi trip datasets we used do not include the fuel fees, we use a simple model that fuel fees is proportional to the distance. In Equation (7), f ule is the cost of gas per kilo meter. Thus, we sum fuel fees and toll fees (toa i ) of each route rt during time period tp. Then, we divide the total sum by the number of routes (n rt,tp ).

Extended Route Summary
If an area summary and a route summary are managed and stored separately, we need to access these summaries in two steps to retrieve candidate profitable areas, as depicted in Figure 9. When a user provides a current area and a current time to our system, DISPAQ first checks route summaries that start from the user-specified area. Next, it estimates an expected arrival time and a candidate area from each route summary. Then, it searches area summaries to obtain benefits and expenses of the candidate area by using the pair (candidate area, expected arrival time) as a spatio-temporal key.   Figure 9. Two steps for constructing a profitable area map.
To fetch candidate area information in one step, we propose an extended route summary which is a combination of area summary and route summary. Abstractly, a route summary is augmented with area summaries that are retrieved with the pair (drop-off area of a route, expected arrival time period). Figure 10 presents an example of an extended route summary. A spatial temporal hash key has two elements denoted as a green box and an orange box. A green box is a pointer a set of extended route summaries. Each extended route summary has two pointers to the minimum and maximum time periods. A destination area summary is connected to each time period. An orange box is a pointer to an area summary which contains aggregated taxi trip information of the origin area. Each dashed rectangle means an extended route summary that is a combination of a route summary and area summaries. Formally, we define an extended route summary as follows.

Definition 6. (Extended Route Summary)
An extended route summary ERS rt,tp contains a 5-tuple (RS rt,tp , ti min , AS ar d ,ti min , ti max , AS ar d ,ti max ), where rt is a route starting from area ar o to area ar d and tp is a time period. For the given route rt and time period tp, we calculate and maintain the following attributes as an extended route summary: (1) RS rt,tp is a route summary; (2) ti min is a time interval of the first partition for the expected arrival times; (3) AS ar d ,tp min is an area summary, where ar d is a destination area and tp min is a time period of the expected arrival times; (4) ti max is a a time interval for the second partition of the expected arrival times; (5) AS ar d ,tp max is a destination area summary, where ar d is a destination area and tp max is a time period of the expected arrival times.
To augment a route summary with area summaries, we first need to compute the expected arrival times and decide the time period(s) for the expected arrival times. Since each route summary is associated with time period tp containing time points [t 1 , t 2 , ..., t n ], we compute the expected arrival time by adding average travel time µ tt of the route summary to each time point t i ∈ tp. The expected arrival time is used as a time period for the augmented (destination) area summary. Since we use a time period with a specific length computed with Equation (1), we should consider non-split and split cases when we add an area summary to a route summary. The non-split case happens when the range of expected arrival times is fully included within a specific time period. In this case, since there exists only one time period, we just add an area summary into the route summary of this time period. Otherwise, we split and map the range of expected arrival times into two time periods. The first time period is denoted as tp min and the second time period is denoted as tp max . For each time period, we connected the area summary to the route summary. are computed as 10:11, 10:12, · · · , 10:20 and the time period of the expected arrival times is [10:11, 10:20]. This time period is fully included in the time period used in DISPAQ, and we do not need to split this time period. This period [10:11, 10:20] is used for the (destination) area summary.
However, sometimes there arises an exceptional case where the area summary of a destination area is empty. This will happen if none of taxi trips start from the destination area during the time period of the expected arrival times. We remove this destination area from candidate profitable areas due to lack of information. The left pointer of the first hash key is used to visit an area summary that is represented as a yellow box. Area B has the value $18.40 as an average fare, a value of 0.2 as a passenger-demand probability during 10:01-10:10 on Friday, a set of pickup probabilities (10:05, 0.3), (10:06, 0.2), etc. By following the right pointer of the first hash key, we can obtain a set of extended route summaries. Repeated routes are aggregated as route summaries represented as green rectangles. The candidate profitable areas can be effectively retrieved by accessing area summaries that are connected to the route summaries.

Distributed PQ-Index Construction
In this subsection, we shall explain how to construct a PQ-index from raw taxi trip data that corresponds to the definitions in Section 4.1.
To handle huge volumes of taxi trips efficiently, we devised a distributed PQ-index construction. Figure 13 illustrates the overview of PQ-index construction of our DISPAQ system, which is implemented on top of Spark. DISPAQ starts the construction process when a driver application in a client sends a command to a cluster manger (master) of Spark ( 1 ). The Spark cluster manager sends a configuration to all commodity servers that will function as worker nodes ( 2 ) or a MongoDB master ( 3 ). Note that we simultaneously use one commodity server as a Spark master and a MongoDB master for this configuration. These masters can be installed in different commodity servers. Worker nodes will process all of the distributed PQ-index construction steps, whereas the MongoDB master prepares shard servers for storage of the PQ-index as the final result of a worker node job ( 4 ). Again, the same commodity servers of Spark will serve as MongoDB shards (nodes).  Algorithm 1 explains the detailed steps executed in the commodity servers of Spark. A worker node executor reads huge volumes of taxi trip data stored in HDFS and extracts taxi trip information TI (Line 1 and denoted as 5 ). Note that circled numbers are illustrated in Figure 13. During the extraction process, the executor initializes the summary data structures and remove unused attributes. Then, the executor continues to group taxi trip information based on the pair (pickup area, time period) for the area summary (Line 4) and based on the pair (route, time period) for the route summary (Line 5). After grouping, the executor computes all possible combinations of the area summary and route summary: Lines 8 and 9, also denoted as 6 and 7 . The extended route summary is built by connecting the area summary and route summary: Line 10 and 8 . Then, a distributed PQ-index construction is completed by merging the area summary and the extended route summary that has the identical key: Line 11 and 9 . Finally, the executor sends a constructed PQ-index to MongoDB shards (Line 12 and 10 ).

Algorithm 2: Build an Area Summary
Input: AG: a tuple (key, L), where key is a pair (area, time period) and L is a list of taxi information Output: ASP: a pair (spatio-temporal hash-key key, an area summary AS) 1 Initialize AS as Area Summary; // calculate area summary value 2 AS key ← AG key ; 3 AS key .µ f is calculated from each group of AG key,L ; // Equation (2) 4 AS key .List is computed from each group of AG key,L ; // Equation (3) 5 AS key .pd is calculated from each group of AG key,L ; // Equation (4) 6 ASP ← pair(AG key ,AS); 7 return ASP; Algorithm 3 presents steps for building a route summary. Basically, it implements Equations (5)- (7) to calculate elements of a route summary (Lines 3-5). This algorithm not only calculates the elements of a route summary but also computes the time intervals by considering split and non-split cases explained in Figure 11 (Line 8). Then, it returns RSP for easier construction of the extended route summaries.

Algorithm 3: Build a Route Summary
Input: RG: a tuple (key, L), where key is the pair (route, time period) key and L is a list of taxi information Output: RSP: a tuple (a pair (ar, tp), area, first time period, second time period, a route summary) 1 Initialize RS as a route summary; // compute elements of a route summary 2 RS key ← RG key ; 3 RS key .µ d is calculated from each group RG key,L ; // Equation (5) 4 RS key .µ tt is computed from RG key,L ; // Equation (6) 5 RS key .µ c is calculated from RG key,L ; // Equation (7) 6 a destination area ar d ← RG key .getDestArea(); 7 an origin area ar o ← RG key .getOriginArea(); // compute two time intervals: tp min and tp max 8 ArrivalTimeMapping (tp min , tp max , RS key ); // make an RSP with time invtervals for the extension 9 a spatio-temporal hashkey hkey ← a pair of (ar o , RG key .tp); 10 RSP ← a tuple of (hkey, ar d , tp min , tp max , RS); 11 return RSP; Algorithm 4 illustrates the processes for building an extended route summary as explained in Section 4.1.4. We augment two area summaries for a given input route summary based on a destination area of the route and the expected arrival time period (Lines 3-6).

Complexity Analysis of PQ-Index Construction
In this subsection, we analyze the complexity of a distributed PQ-index construction method by providing a serial execution cost and then a distributed execution cost. We use the cost model similar to that used for finding k-most promising products (k-MPP) [60].
To construct a PQ-index in a single commodity server, DISPAQ executes the several steps explained in Section 4.2. First, it extracts taxi trip information TI from a raw taxi trip dataset by removing unrelated data for profitable areas. Suppose, for given taxi trip dataset D, the time to extract the taxi trip information is T ext (D). We use |TI| to represent the size of the extracted taxi trip information. Next, DISPAQ generates area summaries and route summaries from TI and builds extended route summaries by augmenting a route summary with area summaries. The summary construction times are T as (|TI|) and T rs (|TI|). The total sizes of area summaries and route summaries are denoted by |AS| and RS|. Then, it combines the extended route summary and the area summary based on the pair (area, time period), which is a spatio-temporal hash-key. The extended route summary construction time is T ers (RS| + |TS|). The execution time of the merge step in Algorithm 1 is denoted as T merge . Equation (8) represents the runtime complexity of constructing the PQ-index by summing up the sub-processes' average runtime in a single commodity server: The runtime complexity of constructing a distributed PQ-index can be computed as follows.
Assume that N commodity severs are used for the distributed construction and each server has a equally divided subset of data. We ignore the implementation overhead of synchronization and data communications among all servers. Equation (9) illustrates the complexity of distributed PQ-index construction in N commodity server environments.

Processing Profitable-Area Query
In this section, we shall explain how to find profitable areas when a user query is given to DISPAQ.
The processing of a profitable-area query is executed in two steps: (1) retrieve candidate profitable areas into a profitability map by utilizing the PQ-index and (2) refine candidate profitable areas in the profitability map by exploiting extended skyline query processing.

Profitable-Area Query
As explained in Section 1, several factors affect taxi drivers' strategies to determine profitable areas that guarantee more passengers. Since our DISPAQ system solves this problem based on skyline query processing, we define three major terms under the concept of skyline query processing.
To formulate a profitable-area query, we begin by defining a profitable area.

Definition 7. (Profitable Area)
A profitable area PA ar,tp is defined by a 4-tuple (p, pd, t cr , d cr ). The input parameters ar and tp mean area and time period, respectively. The profitable area PA ar,tp contains four aggregated values: (1) p as profit, (2) pd as passenger demand, (3) t cr as cruising time, and (4) d cr as cruising distance.
Profitable area PA ar,tp contains the aggregated values of an area that follows Definition 2 and is denoted by a geohash value. Several factors affect taxi drivers' passenger search strategies.
Thus, we chose four factors from the taxi trip data explained in Section 3.2. The aggregated values of these factors are calculated based on an area ar and a time period tp since these values vary with each pair (area, time period); p denotes an approximate amount of income for taxi drivers if they pick up passengers from area ar at time period tp and pd denotes the probability of a taxi driver taking passengers from area ar compared with other areas within the same time period, tp; t cr is the average elapsed time it takes taxi drivers to get passengers in area ar from the current area of the input query; d cr means a distance between the area ar from the current area of the input query. How to compute these four values is explained in Section 5.
A profitability map PM ar,tp is a set of profitable areas. After DISPAQ receives the current location and current time from a user, it computes a profitability map that contains candidate profitable areas from the pair (current location, current time). A visualization example of skyline from taxi trip data is illustrated in Figure 14. To simplify the problem, we only consider two factors (dimensions) from the table in Figure 1b for deciding the skyline of profitable areas. When we read the first row in the table, area B is considered a skyline because we do not have other areas for comparison. Then, we read area C and find that C is dominated by area B because it has a longer cruising time and a longer distance. This condition also occurs in areas D, E, F, and G. Next, when we read area H, we regard area H as an element of a skyline because its distance is smaller although the cruising time is longer than area B. Then, we also consider area I as an element of a skyline, because this area dominates other skyline areas in the cruising distance factor. Finally, we decide areas B, H, and I are the skyline areas. Every time we read a taxi trip, we need to check the dominance of the trip against every other taxi trip by using all dimensions. By applying a dominance test, we can ensure that a profitable area is not dominated by other profitable areas.
Finally, a profitable-area query is defined as follows.  Figure 14 again. Assume that a user sends his location (B) and current time (2016/12/18 10:11) to DISPAQ. DISPAQ computes candidate profitable areas from taxi trip data depicted in Table 2 and creates a profitability map as shown in Figure 15. After executing the profitable-area query, it returns areas B, H and I as the results, based on Definition 9.  Figure 15 illustrates the relationships among the three terms: (1) profitable area, (2) profitability map and (3) the answers for profitable-area query processing. How to construct and utilize the profitable map will be explained in the following subsection.

Retrieving Candidate Profitable Areas into a Profitability Map
After constructing the PQ-index, DISPAQ is ready to receive a user query that contains an area from the current location and a time period from the current time. A pair (area, time period) helps DISPAQ to efficiently retrieve candidate areas by exploiting the extended route summaries of the PQ-index. When DISPAQ builds a PQ-index, it pre-computes benefits of candidate areas by considering several factors and stores them in the extended route summaries. A set of candidate profitable areas is collected into a profitability map in our DISPAQ system. Note that the formal definitions of a profitable area and a profitability map are defined in Definition 7 and Definition 8, respectively.
Consider again Figure 15, which illustrates an example profitable map including several candidate profitable areas. Each profitable area maintains four factors (profit, passenger demand, cruising time, and cruising distance) as attributes. As explained in Section 4.

Refining Candidate Profitable Areas
A profitability map maintains a set of candidate profitable areas. However, all areas included in the profitability map cannot be recommended to a taxi driver who sends a query to the system.
A refinement step is to remove non-dominated profitable areas from the profitability map based on the concept of skyline query processing. For this purpose, we suggest a Z-skyline method, which is extended skyline processing with a Z-order filling curve.

Z-Order Values to Profitable Areas
As explained in Section 5.1, skyline query processing facilitates refining candidate profitable areas in a profitability map. However, computing skylines from a whole dataset is an expensive operation since it requires comparison of each element to all the other elements in the dataset which is called a dominance test. Thus, to reduce expensive dominance tests, Z-order space filling curve is utilized for computing skylines [25]. DISPAQ adopts skyline processing with Z-order, called Z-Skyline, as a basic algorithm for refining candidate profitable areas. Skyline query processing can be improved with two characteristics of a Z-order curve: (1) automatic clustering of the data and (2) monotonic order [25].
The first characteristic can be achieved if we consider the same prefixes of Z-order values for profitable areas. For example, profitable areas H, B, C, E and I could belong to one cluster because they have the same first two bits "00". We call this cluster as a region.
Formally, a region can be defined as follows.
Definition 12. (Region) Region R i is a set of profitable areas that satisfies the following condition: Example 12. Figure 19 depicts a clustering example. Assume that we map profitable areas of two dimensions into four regions by considering the first two bits of Z-order values. Then, profitable areas B, H, C, E and I are clustered into Region R 1 , whereas a profitable area G is clustered into Region R 4 . Figure 19. A region example.
The second characteristic of the Z-order curve (monotonic ordering of Z-order values) guarantees that a small dimensional position comes before a larger dimensional position. A profitable area with a small Z-order value is accessed before a profitable area with a large Z-order value, which means a dominating profitable area is accessed before the dominated profitable area. This removes unnecessary dominance tests and candidate re-examinations [25]. Figure 19 again. Region R 1 becomes the first accessed region, followed by Region R 2 , Region R 3 , and finally Region R 4 . In region R 1 , five profitable areas exists: B, C, E, H, and I. Among these regions, H is accessed first, followed by B, I, C and E during skyline query processing.

Example 13. Consider
Owing to two characteristics of the Z-order curve, our Z-skyline approach effectively minimizes the dominance test during the skyline process. Automatic clustering enables DISPAQ to utilize efficient block-based dominance tests, instead of checking the pairwise profitable area dominance test. Monotonic ordering prevents unnecessary candidate re-examinations. Thus, the distributed profitable-area query processing of DISPAQ is mainly based on the following Lemma [25]. Lemma 1. Given two regions, R i and R j , the following three cases can happen during the refining process for final profitable areas.
(1) All profitable areas in region R j are dominated by region R i .
(2) Some profitable areas in R j may be dominated by others in R i .
(3) All profitable areas in region R j are not dominated by region R i .

Proof.
We prove the lemma case by case. Let us denote a profitable area with a maximum Z-order value in R i (R j ) as PA i max (PA j max ) and a profitable area with a minimum Z-order value in R i (R j ) as PA i min (PA j min ).  Figure 20c. We will prove this case by contradiction. Assume profitable area PA k ∈ R i dominates profitable area PA l ∈ R j . Then the z-oder value of PA k is smaller than that of PA l . Since we choose profitable area PA k in R i , the Z-order value of PA k is larger than that of PA i min . The Z-order value of PA l is smaller than that of PA j max . If we combine the above statements, we could conclude that Z-order value of PA i min is smaller than that of PA

Profitable-Area Query by Z-Skyline Method
We apply a Z-skyline method to answer profitable-area queries on candidate areas included in a profitability map. As explained before, after receiving a user query, DISPAQ constructs a profitability map by exploiting the PQ-index. Then, it calculates z-values of candidate profitable areas in the profitability map. A small z-value means that the profitable area dominates with a high probability the other areas in all dimensions.
Algorithm 5 describes the proposed Z-skyline algorithm for answering a profitable-area query.
It begins by initializing the final profitable results and a set of regions (Line 1 and Line 2). Z-order values of candidate profitable areas in the with a high probability map are computed in Line 3 and a set of regions, SR, is calculated in Line 4. Then, final profitable areas are obtained based on the three cases in Lemma 1 (Lines 5-17). In case 2, we merge two sets of profitable areas in FP and r.pal and again perform the dominance test. Note that we can skip the dominance test in the case 1.
To prove the correctness of Algorithm 5, we use the loop-invariant technique [61]. This approach examines the correctness of the algorithm in three loop stages: (1) initialization; (2) maintenance; and (3) termination. Thus, we can prove the correctness of the Z-skyline algorithm for refining profitable areas by following the loop-invariant verification method.
Theorem 1 (Correctness of the Z-skyline algorithm). The profitable-area query algorithm is correct with this loop invariant: for any step in a loop, the final profitable areas, FP, is a subset of non-dominated areas from a profitability map PM.

Proof. Initialization:
Before an iteration is started, FP is initially empty. A set of regions SR consists of pairs (z-value, a list of profitable areas) which are areas constructed from PM by grouping profitable areas based on Z-order values.
Maintenance: For each iteration, after checking the emptiness of FP, the algorithm deals with three cases to determine whether profitable areas of a region r become a part of FP: • When FP is empty: profitable areas of region r will added to FP by invoking the dominance test.
Thus, FP contains non-dominated areas.
• When FP is not empty: Candidate profitable areas of region r should be handled based on the three cases in Lemma 1, which guarantees that only non-dominated areas will be added to FP.
Thus, FP also contains a set of non-dominated areas in the case.
Termination: At the end of the iteration, FP contains a subset of non-dominated areas from profitability map PM.
Correctness: This loop-invariant method proves that the algorithm will be terminated and produce the correct results. Example 14. We will use candidate profitable areas from Figure 15 to illustrate the algorithm. Figure 21 depicts the steps in the Z-skyline algorithm by considering four factors (attributes) of a profitable area. First, Z-order computation is applied to candidate profitable areas included in PM B,[Friday 10:01, Friday 10:10] . The Z-order value is presented in the table of Figure 21. Since area H has the smallest Z-order value, it will be accessed first and becomes an initial area skyline. Region r 1038 becomes a skyline region. Next, the algorithm continues to check conditions for each region. Area I (region r 1126 ) becomes the next accessed region. It is added to the skyline regions, being the second case in Lemma 1. Later, area B (region r 3146 ) is compared to the pre-computed skyline areas (areas I and H) and is included in the final answers. This is the third case in Lemma 1. Next, area G (region r 4083 ) is dominated by area H, and it will not be included in the final answers. After checking other profitable areas, profitable areas H, I and B are returned as final profitable areas.

A Distributed Z-Skyline Approach
Dealing with the huge volumes of taxi trip data from major urban cities requires a scalable approach using several commodity servers. For this purpose, we implemented distributed profitable-area query processing on the top of the Apache Spark Core [22] which is a processing framework for distributed computing. Apache Spark supports a parallel processing by dividing the whole job into several sub-processes and merges separated intermediate results of the sub-processes.
The distributed profitable-area query processing that utilizes the Z-skyline algorithm is divided into two steps: (1) a local Z-skyline and (2) a global Z-skyline. In the local Z-Skyline, all commodity servers of DISPAQ find local profitable areas via Z-Skyline which is explained in Algorithm 5.
These intermediate local profitable areas need to be merged in one commodity server by the global Z-skyline computation. The results of the global Z-Skyline are the final profitable areas and will be returned to the user. Note that the global Z-skyline is implemented with Algorithm 5. Figure 22 illustrates the distributed profitable-area query processing based on Spark. A client receives profitable-area query q from a user ( 1 ). The crucial part of the client is a driver that specifies the Spark configuration such as the transformations and actions on RDDs. The driver sends the configuration and query q to the Spark master ( 2 ). Then, the Spark master sends the Spark configuration to all worker nodes ( 3 ) and the parameters of query q to a MongoDB master (mongos) which is located in one of worker nodes ( 4 ) The MongoDB master sends query q to all shards ( 5 ).
A MongoDB shard first obtains the part of the PQ-index corresponding to user query q, and divides the selected PQ-index in several RDDs (resilient distributed datasets), sending it to an executor of the same node for reducing data movement among worker nodes ( 6 ). A commodity server of DISPAQ computes a profitability map from the loaded PQ-index of the executor and removes dominated areas via local Z-Skyline ( 7 ). DISPAQ executes the global Z-Skyline to obtain final profitable areas from these local Z-skyline results ( 8 and 9 ). After completion of the global Z-Skyline processing, DISPAQ returns the final profitable areas as the results of query q (10 and 11 ).

Optimizing a Distributed Z-Skyline Approach
The efficiency of the distributed profitable-area query processing depends on the local Z-skyline since the size of intermediate results influences the performance of the global Z-skyline. If a local Z-Skyline still retrieves a profitability map containing a lot of candidate profitable areas, the global Z-skyline might be the bottleneck of the whole process, because it needs to produce final profitable areas by merging the all intermediate results.
At a local Z-skyline, each partition needs at least one killer area or region that removes most of the dominated areas. However, the intermediate profitability maps are built from random distribution of the profitable areas, which creates an unbalanced distribution of candidate profitable areas.
This happens due to the default settings of Spark. Figure 23a depicts this case. The bottom partition has a profitability map containing a single killer region, whereas the top partition and the middle partition require two regions to eliminate all dominated region. In addition, the positions of killer regions are not as good as the position of the killer region in the bottom partition. Thus, the global Z-skyline must consider six candidate profitable areas to decide on the final profitable areas.
To optimize a local Z-skyline process, we propose an optimized shuffling method that avoids unbalanced profitable area distribution. As we explained in Section 5.3.1, the Z-Skyline has a characteristic that forces the dominant areas to always be placed before their dominated areas.
Thus, if we divide n smallest Z-value areas to n partitions, the killer area will be distributed equally to the local Z-Skyline with this optimized shuffling method. Figure 23b presents the effect of optimized shuffling. Each partition has just one killer region and removes dominated areas more efficiently compared to local Z-Skyline using random shuffling. Finally, the global Z-Skyline determines three final profitable areas by examining only four candidate profitable areas.

Complexity Analysis of Distributed Profitable-Area Query Processing
As explained in Section 5.4, distributed profitable-area query processing requires three phases: (1) constructing a profitability map from a selected PQ-index , (2) refining candidate areas by the local Z-skyline algorithm and (3) merging local skyline results to obtain the global answers that are the final profitable areas.
Suppose for a given profitable query, the distributed algorithm is executed by using N nodes.
We denote the size of the PQ-index as |PQ idx | and the size of local skyline results as |LZSky(PQ inx )|.
Equation (11) describes the complexity of profitable-area query processing, T PAQ , where T PM is the average run time to construct the profitability map T LZSky is the average run time to execute the local Z-Skyline, and T GZSky is the average time to perform the global Z-Skyline.
Distributed profitable-area query processing performs better when a killer region exists in the profitability map and the profitable areas in each region are not dense. This is mainly because the killer region enables the algorithm to effectively avoid unnecessary dominance tests; and fewer profitable areas in each region reduces the number of pairwise comparisons when invoking a dominance test.

Experimental Evaluation
In this section, we present a comprehensive performance evaluation of DISPAQ on two real datasets from New York and Chicago, with about 376 million records and 79 million records of taxi trip information, respectively.

Experimental Setup
We implemented our DISPAQ system in Java using Java Development Kit (JDK) version 1.7.
Spark 1.5 was used as the distributed processing framework and MongoDB 3.2 was chosen for data storage of the PQ-index. All experiments were conducted on commodity machines equipped with an Intel Core i3-6100 3.2 GHz CPU and 8 GB of memory running the 64-bit Ubuntu 16.04 operating system. A total of 5 machines were used as distributed processing clusters for Spark and as data storage nodes for MongoDB. To obtain sound and reliable experimental results, we repeated every test 10 times and averaged all the reported experimental results from all repetitions.

Dataset
We used a two real taxi trip datasets from New York [62] and Chicago [63] in our experiments.
In subsequent discussions, these datasets will be referred to as "NewYork dataset" and "Chicago It consists of more than 100 million taxi rides, dating back to 2013 with an average 300 MB for each month. Compared with the NewYork dataset, the Chicago dataset has several characteristics to avoid privacy issues. First, the pickup and drop-off times are rounded to the nearest quarter of an hour.
Second, the coordinates of each trip are represented as the center coordinates of a census tract and community area. Third, relatively infrequent taxi trips were removed and only frequent taxi trips are maintained in this dataset. Thus, we used the different interval times as explained in Equation (1).

Experimental Results
In this subsection, we analyze the performance evaluation of DISPAQ and compare the profitable-area query processing method of DISPAQ with existing approaches. Our goal is to show that DISPAQ organizes taxi trip data into a PQ-index very well and efficiently executes profitable-area queries in a distributed way.

PQ-Index Construction
To understand the efficiency of DISPAQ, we evaluated the performance of the distributed PQ-index construction. To evaluate indexing performance, we measured (a) the elapsed time of each sub-process and the total wall clock time to build a PQ-index, and (b) the size of the PQ-index by varying the input data size. The length of of the geocode for an area was set to 6, and the length of atime interval was fixed at 10 min for NewYork dataset and at 15 min for Chicago dataset, according to Equation (1).
First, to demonstrate the scalability of DISPAQ's distributed approach, we measured the execution time for constructing a PQ-index by varying the number of commodity servers between 1 and 5.
We used 12 months of taxi trips for this experiment. Figure 24 describes the results from constructing the PQ-index in a distributed way. As we expected, the overall time decreases with the number of commodity servers. We observe that processing times from 1 node to 5 nodes decrease linearly. For example, the total execution times for the NewYork dataset changed from 205 min to 47 min, whereas the total execution times for the Chicago dataset decreased from 21.7 min to 4.71 min. In other words, the processing time with n commodity servers is almost 1/n of the processing time with a single node. This result corresponds to the complexity cost in Equation (8), which shows that DISPAQ inherited the scalability properties of the underlying Spark framework. Another observation is that building area summaries and building route summaries are the most time-consuming processes when a lot of repeated taxi trips exists. Next, we conducted experiments by varying the data range from 6 months to 30 months in steps of 6 months. The number of machines was fixed to 4. Figure 25a,b summarize the performance to create a PQ-index on both datasets. As expected, the total execution time increases as we increase the size of the dataset by varying the number of months. This is mainly because the number of taxi trips also increases as we vary the dataset size. Another observation is that the construction time of the PQ-index from NewYork dataset takes much longer than for the of PQ-index from Chicago dataset.
For example, when DISPAQ constructs a PQ-index from 30 months of taxi trips, it takes about 150 min for NewYork dataset and about 6 min for Chicago dataset. The sizes and qualities of both datasets account for the wide difference.   Tables 3 and 4 show the distributed data flow among sub-processes explained in Algorithm 1.
Since the information extraction process removes unnecessary attributes from taxi trips, the size of a shuffle write is almost half of the input data size. DISPAQ constructs area summaries and route summaries from the same input data; thus, the sizes of a shuffle read for both sub-processes are the same. Since extended route summaries are combinations of area summaries and route summaries, the sub-process for computing extended route summaries generates a bigger output than the size of the input. The last merging summaries sub-process finally writes the whole PQ-index to the disks of commodity servers.  Figure 26 shows the effective reduction in memory consumption by comparing the size of the PQ-index with that of the raw taxi dataset. The gaps in size widen as we increase the data size by varying months from 6 months to 30 months. This is because we remove unused information and aggregated repeated taxi trips into route summaries and area summaries. Another observation is that the size of the PQ-index from NewYork dataset in Figure 26a is much bigger than that of the PQ-index from Chicago dataset in Figure 26b. The main reason is that Chicago dataset avoids privacy issues by rounding pickup and drop-off times to the nearest quarter of an hour, and by grouping GPS coordinates for pickup and drop-off events.

Distributed Query Processing
In this subsection, we analyze the query performance of the DISPAQ system that utilizes a PQ-index.

Query Performance
As explained in Section 5, we implemented distributed profitable-area query processing in two modes: basic Z-skyline and optimized Z-skyline. We also implemented distributed profitable areas query processing based on the traditional skyline approach in two modes: block-nested looping (BNL) and (2) divide-and-conquer (DC). Thus, we randomly chose 10 different profitable-area queries and measured the execution times for these four methods. Figure 27 presents profitable-area query performance. For these experiments, we used 3 machines and varied the data sizes between 6 months and 30 months. Figure 27a,b shows the results for NewYork dataset. We observe that the query execution times for rush hour are slower than normal times. Another observation is that execution times are almost stable, even though the size of the dataset increases according to the number of months. It means the total size of raw taxi trips will not affect query execution time. In all cases, the optimized Z-Skyline method and the basic Z-skyline show better performance than the BNL and DC methods. A similar trend is seen in Figure 27c,d for Chicago dataset. Next, we report the evaluation results of the four methods by varying the number of commodity servers. Figure 28 shows the results. As expected, query execution time decreases with the number of commodity servers. Among the four methods, optimized Z-Skyline shows the best performance followed by basic Z-Skyline, and DC. This result indicates the effectiveness of distributed profitable-area query processing based on the Z-Skyline method.

Local Z-Skyline Optimization
In the optimized Z-Skyline, we maximized the number of results in a local Z-Skyline that qualifies in a global Z-Skyline and minimized areas that will be dominated in the global Z-Skyline. We use Equation (12) from Chen et al. [55] to measure the optimality of the local Z-Skyline, where N is the number of nodes, sky i is the local Z-Skyline results (candidate profitable areas) in node i, and sky global is global Z-Skyline results (final profitable areas).
The values of optimality for both methods are depicted in Figure 29. Figure 29a shows the results when we fixed the number of machines to 3 and varied the sizes of the datasets. The value of the optimized Z-Skyline is always higher than that of the Z-skyline. The optimality values of the optimized Z-Skyline are lowest with 6 months of data; then they increase and become stable after 12 months of data. When we use only 6 months of data, the killing regions in each node are not good enough to dominate other profitable areas, which will be removed during the global Z-skyline.
Another observation is that the optimality value of the Z-Skyline is not stable. This is mainly because Z-Skyline uses random distribution when distributing regions to each node.  Figure 29b illustrates the results when we fixed the size of the dataset and varied the number of nodes from 2 to 5. We skipped the optimality value for a single node because it is not meaningful when a single node is used. Since we compute the optimality value by dividing the number of nodes, as defined in Equation (12), the optimality values decreases with the number of nodes. However, the optimality values for the optimized Z-skyline are always higher than the basic Z-skyline.

Conclusions
In this paper, we address the problem of efficiently retrieving profitable areas when a user poses queries from huge volumes of taxi trip data. We implement a distributed profitable-area query processing system, called DISPAQ, by employing Spark and MongoDB. To efficiently obtain candidate profitable areas, DISPAQ constructs a hash-based spatio-temporal index, a PQ-index, for maintaining information on profitable areas from raw taxi trip data. DISPAQ utilizes a Z-skyline algorithm which considers multiple attributes to refine candidate profitable areas. The PQ-index and the Z-Skyline algorithm enable DISPAQ to limit search spaces and avoid a pairwise dominance test among profitable areas during profitable-area query processing. We also suggest an optimization scheme for the Z-skyline algorithm, which efficiently prunes multiple blocks during query processing by distributing killer regions to each node. Performance evaluations on two real datasets demonstrate that the DISPAQ approach provides a scalable and efficient distributed solution for indexing and querying huge volumes of taxi trip data. Our experimental results confirm the scalability and effectiveness of DISPAQ for processing profitable-area queries.