Next Article in Journal
Theory and Design of Blink Jamming
Previous Article in Journal
BIM-LEAN as a Methodology to Save Execution Costs in Building Construction—An Experience under the Spanish Framework
 
 
Article
Peer-Review Record

SAHA: A String Adaptive Hash Table for Analytical Databases

Appl. Sci. 2020, 10(6), 1915; https://doi.org/10.3390/app10061915
by Tianqi Zheng 1,2,*, Zhibin Zhang 1 and Xueqi Cheng 1,2
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Appl. Sci. 2020, 10(6), 1915; https://doi.org/10.3390/app10061915
Submission received: 3 February 2020 / Revised: 29 February 2020 / Accepted: 9 March 2020 / Published: 11 March 2020
(This article belongs to the Section Computing and Artificial Intelligence)

Round 1

Reviewer 1 Report

A very interesting approach to using hash tables in databases.
The idea is presented in detail and illustrated with diagrams
and figures.
           

Author Response

Thank you very much for reviewing.

Reviewer 2 Report

Referee Report

Title: SAHA: A String Adaptive Hash Table for Analytical Databases

Manuscript ID: applsci-723336

By Zheng et al

Submitted to Applied Sciences

 

Comment

This work suggested a hybrid hash table called String Adaptive Hash Table (SAHA) to store strings. The SAHA takes advantages of the string length dispatching that saving the long string keys of hash values to avoid re-computations. From the results of comparison with other methods, the authors found that their SAHA is almost 100% faster than the second best. This work is very well organized and prepared. It is a serious work and I only have some minor comments:

 

  1. Figure 10: It can be seen that for the Weibo data, SAHA did not have the lowest memory consumption but HATTrie, though it may take a longer time.
  2. Figure 11: Can the authors explain in details between SAHA and SAHA prehash? What situation we should use SAHA and what situation we should use SAHA prehash?
  3. SAHA is related to the efficient of the CPU and memory. Did the authors try to use different CPU or memory (i.e. hardware) to test their method?
  4. From the results, all tables used to finish the tasks within a few seconds. Therefore, improving the efficiency can only make the task to be faster for one to two seconds. Please discuss if this improvement is significant in general management in the real application.

Author Response

Thank you very much for reviewing. I've revised the manuscript according all the comments. All revisions are either highlighted or surrounded by tags <X.Y>, which could help navigate to the added content that is related to a given comment.

Following is the cover letter:

1. English language and style are fine/minor spell check required

I've carefully examined all English spell and grammar related errors and highlighted related modifications.

2. Figure 10: It can be seen that for the Weibo data, SAHA did not have the lowest memory consumption but HATTrie, though it may take a longer time.

I've added a description of this fact, marked in <2.2>

3. Figure 11: Can the authors explain in details between SAHA and SAHA prehash? What situation we should use SAHA and what situation we should use SAHA prehash?

In general, we should always use SAHA with pre-hashing since it improves the hash table for long string datasets by a large margin with only a negligible regression on short string datasets. I've highlighted the advantages of the pre-hashing technique and added a conclusion, marked in <2.3>.

4. SAHA is related to the efficient of the CPU and memory. Did the authors try to use different CPU or memory (i.e. hardware) to test their method?

We didn't test against different hareware configurations because the optimizations we use are quite general. For example, inlining string should work for all kinds of memory, aligned memory loading works for vectorized CPUs and almost all modern processors support vectorization, pre-hashing reduces cache usage and in-core CPU cache is still precious due to multicore architecture. On the other hand however, SAHA is already merged into the ClickHouse database and its efficiency has been verified by many different production machines.

5. From the results, all tables used to finish the tasks within a few seconds. Therefore, improving the efficiency can only make the task to be faster for one to two seconds. Please discuss if this improvement is significant in general management in the real application.

Real world analytical workloads usually contains multiple aggregation tasks per query and the performance gain provided by SAHA can be accumulated linearly. SAHA also scales linearly when the dataset gets larger. To verify this, I've added one additional experiment with complete Weibo data (1 billion rows) which is used in some text sentimental analysis tasks (real world application), marked in <2.5>

Reviewer 3 Report

Analytical database queries are at the core of business intelligence and decision support systems. To analyze the vast amounts of data available today (big data), query execution needs to be orders of magnitude faster. This paper focuses on the design and implementation of highly efficient database systems by optimizing analytical query execution using hash tables.

A hash function is any function that can be used to map a data set of an arbitrary size to a data set of a fixed size, which falls into the hash table. The values returned by a hash function are called hash values, hash codes, hash sums, or simply hashes. To achieve a good hashing mechanism, it is important to have a good hash function with the following basic requirements:

  1. It should be easy to compute.
  2. It should provide a uniform distribution across the hash table and should not result in clustering.
  3. Less collisions: Collisions occur when pairs of elements are mapped to the same hash value. These should be avoided.

In this paper, the authors address some common use cases of hash tables: aggregating and joining over arbitrary string data. They propose an hybrid hash table implementation, namely SAHA, integrated within analytical databases. The main idea consists in inlining short strings and saving hash values for long strings only. The proposed hash table is based on special memory loading techniques and uses vectorized processing to batch hashing operations.

The authors evaluate SAHA hash table on different data sets and compare it with different existing hash tables in the literature. They show that SAHA outperforms the state-of-the-art hash tables by one to five times in analytical workloads.

Comments

  • The paper is well written and well organized
  • Many hash tables are presented, explained and compared
  • A comparative study is achieved to compare SAHA method with state-of-the-art hash tables
  • Many tests are provided to show the performance of SAHA hash table

Questions

How could you prove that your proposed SAHA hash table respects the basic requirements for hash table mainly uniform distribution and less collisions?

The authors evaluate their method by providing tests and results (benchmarks, time execution, analytical operations,…). To confirm these results, it is important to provide one or more cost model parameters to confirm the practical validation. I mean, whar are the more important parameters that could improve hash tables in general and more precisely in the case of your proposition?

 

Author Response

Thank you very much for reviewing. I've revised the manuscript according all the comments. All revisions are either highlighted or surrounded by tags <X.Y>, which could help navigate to the added content that is related to a given comment.

Following is the cover letter:

1. English language and style are fine/minor spell check required

I've carefully examined all English spell and grammar related errors and highlighted related modifications.

2. To confirm these results, it is important to provide one or more cost model parameters to confirm the practical validation

I've added the cost model marked in <3.2> that can help validating our optimizations. Some micro benchmark experiments are also added to identify the important parameters of the given model.

3. How could you prove that your proposed SAHA hash table respects the basic requirements for hash table mainly uniform distribution and less collisions?

If I understand correctly, the question is asking about the quality of the hash function we used, as the reviewer posted:

    > ...it is important to have a good hash function with the following basic requirements:
    > It should be easy to compute.
    > It should provide a uniform distribution across the hash table and should not result in clustering.
    > Less collisions: Collisions occur when pairs of elements are mapped to the same hash value. These should be avoided.

I agree that a good hash function is very important and there are extensive research topics about hash functions, which is appealing to discuss. However, we limit our work in hash tables without hash functions because:

    1. Our contributions are mainly low-level system optimizations that can be applied to general hash tables. These optimizations are orthogonal to the hash functions and we can adopt new hash functions if they appear to have better propery, quality or proved efficiency in production workloads. This is also explained in <3.2>.

    2. It is impossible to theoretically prove a hash function that can generate uniform distribution in all datasets. For any hash function, we can construct an ill-formed dataset that would lead to plenty of collisions. As a result, we need to carry on extensive experiments with real world datasets to illustrate that a hash function, f.g. CRC32Hash, is production-ready, which is useful for industrial databases.

Back to TopTop