Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

THBase: A Coprocessor-Based Scheme for Big Trajectory Data Management

Future Internet 2019, 11(1), 10; https://doi.org/10.3390/fi11010010

by Jiwei Qin, Liangli Ma^* and Jinghua Niu

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Future Internet 2019, 11(1), 10; https://doi.org/10.3390/fi11010010

Submission received: 5 November 2018 / Revised: 12 December 2018 / Accepted: 27 December 2018 / Published: 3 January 2019

(This article belongs to the Special Issue Big Data Processing and Analytics in the Era of Extreme Connectivity and Automation)

Round 1

Reviewer 1 Report

1. In other methods, the trajectory segments are grouped according to the spatial adjacency or temporal order of the segments, while the T-table groups them based on the object identifier. According to this paper, this method is effective for data transmission and refining results. However, we should access more HBase Regions and HFiles for query processing because each candidate segment is scattered across different HBase Regions. In spatio-temporal query processing, is the data transmission more important than disk I/O in the overall system?

2. G-index divides the entire space into grid cells and recursively divides each cell if there is a hot spot. In this method, one segment may be redundantly stored in several cells, or sometimes the split threshold is ignored. It would be better to describe why it is better to use a G-index instead of the Quad-tree.

3. Experimental results show that co-processor based algorithms are more efficient. It would be better to explain why the data structures and indexes of the other two methods implemented on HBase is not appropriate to exploit the co-processor mechanism.

4. I'm wondering how the experiment adjusted the size of the raw data. Are the period and the number of moving objects the same in small and large databases? Are the time intervals of the indexes the same?

5. Some typos and text editing are necessary.

- The sentences of line 35 and line 36 were duplicated.

- The sentences in the lines 198-200 and 201-203 should be written more concisely and clearly.

Author Response

Response to Reviewer 1 Comments

Point 1: 1. In other methods, the trajectory segments are grouped according to the spatial adjacency or temporal order of the segments, while the T-table groups them based on the object identifier. According to this paper, this method is effective for data transmission and refining results. However, we should access more HBase Regions and HFiles for query processing because each candidate segment is scattered across different HBase Regions. In spatio-temporal query processing, is the data transmission more important than disk I/O in the overall system?  

Response 1: We think that disk I/O overhead is also a very important factor, and the greater impact on query efficiency between it and data transmission depends on the query conditions. In most cases, we think the I/O overhead of THBase is acceptable because of two reasons:

(1) The spatio-temporal query algorithms of THBase are implemented by parallel random access to many Regions. During the access of each Region, we don’t need to read the whole data due to the hierarchical index structure of HFile. Therefore, when processing a spatio-temporal query, for each HFile, expect for the hundreds of kilobytes of data blocks in the "Load-on-open Section" that must be read, the reading of other data blocks is determined by the query rowkeys. In general, the disk I/O overhead in the overall system is acceptable. However, if T-table has too many Regions across nodes, the cost of reading the data blocks in the "Load-on-open Section" may be very high. If at the end only few results are returned, the benefits of co-location in terms might not outweigh the disk I/O costs.

(2) All data of the same moving object in T-table is distributed in continuous rows. In the query process, we sort the rowkeys after obtaining the candidate rowkeys through L-index. On one hand is to avoid sorting the final trajectory point results, and on the other hand is to better utilize the cache block mechanism of HBase. Since adjacent rows are more likely to be distributed in the same data block, thus avoiding loading data block from disk. This data distribution mechanism is more effective when processing for the query with a long time condition.

The reason why other methods used spatial adjacency or temporal order to group the trajectory segments, I think there are two reasons:

(1) The differences in query definition. For range query, some schemes (scheme proposed in literature [22] in our paper, RM-HBase, etc.) directly output results in the form of segments or points without requiring merging them into whole trajectories. Therefore, there is no need to consider the data transmission overhead caused by merging operation. In this case, the above grouping mechanism is undoubtedly efficient.

(2) The influence of the parallel computing framework. In order to ensure efficiency, many methods will process queries through parallel computing frameworks such as MapReduce and Spark. There are many spatio-temporal data or trajectory data management systems that combine MapReduce and HBase, such as HST and RM-HBase. However, MapReduce usually only accesses HBase tables through the Scan interface, namely all data in the scan range needs to be read. In order to make full use of this feature, it is a good idea to store the data with relatively close spatio-temporal distance in one Region, so as to avoid accessing irrelevant data as much as possible.

Point 2: G-index divides the entire space into grid cells and recursively divides each cell if there is a hot spot. In this method, one segment may be redundantly stored in several cells, or sometimes the split threshold is ignored. It would be better to describe why it is better to use a G-index instead of the Quad-tree.

Response 2: Thanks for yours suggestion, we have added the related description in section 5.1 for the comparison between G-index and two Quad-tree structures.

Point 3: Experimental results show that co-processor based algorithms are more efficient. It would be better to explain why the data structures and indexes of the other two methods implemented on HBase is not appropriate to exploit the co-processor mechanism.

Response 3: As you suggested, we have added the description in section 7 for explaining why the data structures and indexes of the other two methods implemented on HBase is not appropriate to exploit the co-processor mechanism.

Point 4: I'm wondering how the experiment adjusted the size of the raw data. Are the period and the number of moving objects the same in small and large databases? Are the time intervals of the indexes the same?

Response 4: We split the raw data into chronological sets (day 1 to day 7, day 7 to day 13, etc.), and each of which is about 40GB. In our experiment, we load each chronological set into HBase in turn, so the period is different in small and large databases. According to the number of MOID counted, the number of moving objects is also different. After the load of the first 40GB, the number of moving objects is 170,492, and after the load of the last set, the number is 204, 805. Because the data size per day is roughly equal in this dataset, the time period of T-index are fix 24 hours. When applying THBase to other datasets, the time period of T-index can be can be variable.

Point 5: Some typos and text editing are necessary.

- The sentences of line 35 and line 36 were duplicated.

- The sentences in the lines 198-200 and 201-203 should be written more concisely and clearly.

Response 5: We are very sorry for our incorrect and non-clear writing. We have deleted or re-written the related sentences.

Reviewer 2 Report

The paper is missing related research section. In order to show the contribution of this paper more clearly, the authors should provide a comparison to other existing approaches published in literature (also taking into account fields different from trajectory studies), and explain better the advantages of the proposal.

The paper could also benefit by taking into account the following minor comments:

a) The authors should better clarify the following sentences:

"However, the time attribute is not considered in the schemes for spatial data management, and time conditions cannot be utilized in query process. While the schemes for spatio-temporal data

management consider the time attribute, but they ignore those important non-spatiotemporal attributes such as MOID (Moving Object Identifier), and it is difficult to achieve efficient query for such attributes [9]. "

In addition, a reference should be added to the first sentence to enforce it.

b) In the following sentence:

" it imports a local indexing structure for each Region through Observer coprocessor"

Region and Observer are mentioned but they are not explained since they are concept of HBASE that are explained later.

c) In introduction the data trasmission problem is not clearly introduced. Authors should better motivate the reason behind the study.

d) In the following sentence:

"In Euclidean Space with n dimension, Given a coordinate point"

"Given" should be written in lower case

Also in the following sentence:

" In addition, For the purpose" "For" should be in lower case.

e) In the following sentence:

"In order to avoid distributing the data of the same MO to different Regions after split, we

use the prefix split policy [15] that splits rowkey range by an inversing MOID value. This ensures

that rows with the same prefix locate in the same Region after the split, namely all data of the

same MO is still stored in a Region."

f) It is not clear why it is ensured that rows with the same prefix locate in the same Region after the split.

Authors should better motivate this statement.

Author Response

Response to Reviewer 2 Comments

Point 1: The paper is missing related research section. In order to show the contribution of this paper more clearly, the authors should provide a comparison to other existing approaches published in literature (also taking into account fields different from trajectory studies), and explain better the advantages of the proposal.  

Response 1: Thanks for your suggestion, we have added the related research work in section 2.3, which introduces the existing approaches of distributed spatial data management schemes, distributed spatio-temporal data management schemes and distributed trajectory data management schemes.

Point 2: The authors should better clarify the following sentences:

"However, the time attribute is not considered in the schemes for spatial data management, and time conditions cannot be utilized in query process. While the schemes for spatio-temporal data management consider the time attribute, but they ignore those important non-spatiotemporal attributes such as MOID (Moving Object Identifier), and it is difficult to achieve efficient query for such attributes [9]. "

In addition, a reference should be added to the first sentence to enforce it.

Response 2: As you suggested, we have re-written the related sentences and added a related reference for the first sentence.

Point 3: In the following sentence:

" it imports a local indexing structure for each Region through Observer coprocessor"

Region and Observer are mentioned but they are not explained since they are concept of HBASE that are explained later.

Response 3: Thanks for your advice, we have added a brief introduction to HBase, Region and coprocessor in section 1.

Point 4: In introduction the data transmission problem is not clearly introduced. Authors should better motivate the reason behind the study.

Response 4: In order to introduce the data transmission problem clearly, we have redescribed the problem of data transmission in section 1 according to the following two aspects: (1) the isolation of index and data; (2) all data of the same moving object data is scattered across different nodes.

Point 5: In the following sentence:

"In Euclidean Space with n dimension, Given a coordinate point"

"Given" should be written in lower case

Also in the following sentence:

" In addition, For the purpose" "For" should be in lower case.

Response 5: We are very sorry for our incorrect writing. We have modified the related sentences.

Point 6: In the following sentence:

"In order to avoid distributing the data of the same MO to different Regions after split, we use the prefix split policy [15] that splits rowkey range by an inversing MOID value. This ensures that rows with the same prefix locate in the same Region after the split, namely all data of the same MO is still stored in a Region."

It is not clear why it is ensured that rows with the same prefix locate in the same Region after the split. Authors should better motivate this statement.

Response 6: As you suggested, we have redescribed the prefix split policy by using a clearer statement in section 4.

Round 2

Reviewer 1 Report

1. As mentioned in the author’s response 1, the range and kNN query definitions given in this paper are different from those given in other studies cited in this paper. In order to avoid these confusions, it would be better to briefly describe these differences of the range and kNN queries or to provide examples before discussing the data transmission problem in the introduction section.

2. In the definitions 4 and 5, please make sure the reference number 16 is correct.

3. Please check if the following sentences are grammatically correct.

“…each grid cell should be subdivided by quad-tree …” at line 271.

“It is difficult to implement an efficient shuffling algorithm by coprocessor mechanism to merger sub-trajectories into whole trajectories.” at line 472-474.

4. On Line 418, Lemma 1 is written alone without any other theorem or proof. I think the author should prove it or refer to other paper if it is already proven in other research.

Author Response

Point 1: As mentioned in the author’s response 1, the range and kNN query definitions given in this paper are different from those given in other studies cited in this paper. In order to avoid these confusions, it would be better to briefly describe these differences of the range and kNN queries or to provide examples before discussing the data transmission problem in the introduction section.  

Response 1: Thanks for your suggestion, we have added the related description in introduction section for illustrating these differences in query processing.

Point 2: In the definitions 4 and 5, please make sure the reference number 16 is correct..

Response 2: We are very sorry for our incorrect citing. We have modified the wrong references.

Point 3: Please check if the following sentences are grammatically correct.

“…each grid cell should be subdivided by quad-tree …” at line 271.

“It is difficult to implement an efficient shuffling algorithm by coprocessor mechanism to merger sub-trajectories into whole trajectories.” at line 472-474.

Response 3: We are very sorry for our Incorrect writing, and we have re-written the related sentences.

Point 4: On Line 418, Lemma 1 is written alone without any other theorem or proof. I think the author should prove it or refer to other paper if it is already proven in other research.

Response 4: As your suggestion, we have cited a related literature for it.

Reviewer 2 Report

The authors addressed all the reviewers comments and for this reason, it is proposed to accept the article in present form

Author Response

Point 1: The authors addressed all the reviewers comments and for this reason, it is proposed to accept the article in present form 

Response 1: Thanks reviewers for good comments and hard work.

Article Menu

THBase: A Coprocessor-Based Scheme for Big Trajectory Data Management

Further Information

Guidelines

MDPI Initiatives

Follow MDPI