Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Text Indexing for Faster Gapped Pattern Matching

Algorithms 2024, 17(12), 537; https://doi.org/10.3390/a17120537

by Md Helal Hossen¹, Daniel Gibney¹

and Sharma V. Thankachan^2,*

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Reviewer 4:

Xiao Sun

Reviewer 5:

Takuya Kida

Algorithms 2024, 17(12), 537; https://doi.org/10.3390/a17120537

Submission received: 9 October 2024 / Revised: 13 November 2024 / Accepted: 21 November 2024 / Published: 23 November 2024

(This article belongs to the Special Issue Selected Algorithmic Papers from IWOCA 2024)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

I have one style comment and some minor text edits.

The style comment is about Lemma 9. In this Lemma the authors establish that a sub-array of size less than G may occur in at most 18 leaf sub-arrays. My comment is that this result gives a sort of messy feeling to the structure. My suggestion would be to choose the leaf sub-arrays from the arrays given in Figure 1. This meant that the tree (of section 3.1) was not built by choosing exactly the midpoint of the array A, but instead something like G ⌊(1+N)/(2G)⌋. This way Lemma 9 can be rewritten to at most 2 leaf sub-arrays.

Rewrite the sentence in the preprocess of Problem 3 (line 98). It is intuitively interpreted as N being the size of U. I know that is not what it says, but it is confusing. Swapping U and the sum may help.

Consider trying to merge the text of Problem 3 and 4, as they are quite similar.

Line 129 "Both arrays ARE decomposed into .."

Check the exponent 2^k in line 130. Should it be 2^j?

Check line 240 "... not as non-middle ..."

Line 308 "effect -> affect"

Author Response

• Comment: The style comment is about Lemma 9. In this Lemma the
authors establish that a sub-array of size less than G may occur in at
most 18 leaf sub-arrays. My comment is that this result gives a sort of
messy feeling to the structure. My suggestion would be to choose the leaf
sub-arrays from the arrays given in Figure 1. This meant that the tree (of
section 3.1) was not built by choosing exactly the midpoint of the array
A, but instead something like G⌊(1 + N)/(2G)⌋. This way Lemma 9 can
be rewritten to at most 2 leaf sub-arrays.

• Response: We spent some time considering the modification to the midpoints
suggested by the reviewer and even made a revised draft, updating
all the lemmas as needed. We found, however, that this complicated a
number of other parts of the proof and pseudo-code, making them less
intuitive than what was there previously. More importantly, we discovered
by reviewing the proof of Lemma 9 that with our current choice of
midpoint, we can show that a subarray of size at most G can have nonempty
intersection with at most three middle child leaves. We believe
this alleviates the messiness issue mentioned by the reviewer while also
avoiding any of the additional complications that we were seeing with the
suggested modification.

• Comment: Rewrite the sentence in the preprocess of Problem 3 (line
98). It is intuitively interpreted as N being the size of U. I know that is
not what it says, but it is confusing. Swapping U and the sum may help.

• Response: Agreed. We see how this could be confusing, so we changed
it as advised, putting the definition of N first.

• Comment: Consider trying to merge the text of Problem 3 and 4, as
they are quite similar.

• Response: Per the reviewer’s comment, we have merged the definitions
of Problems 3 and 4.

• Comment: Line 129 ”Both arrays ARE decomposed into ..”

• Response: Fixed.

• Comment: Check the exponent 2k in line 130. Should it be 2j?

• Response: Fixed.

• Comment: Check line 240 ”... not as non-middle ...”

• Response: Fixed.

• Comment: Line 308 ”effect → affect”

• Response: Fixed

Reviewer 2 Report

Comments and Suggestions for Authors

In the gapped string indexing problem, one has to pre-process a text of length n to support the following queries. Given two patterns P1 and P2 and an integer range [a,b], count/report all pairs of occurrences of P1 and P2 separated by s text positions (i.e. distance between the end of the occurrence of P1 and the beginning of the occurrence of P2), with s \in [a,b]. Conditional lower bounds suggest that the index must take essentially quadratic space if the query time has to be polylogarithmic. Previous research has shown that polynomial time (with arbitrarily small exponent) can be achieved in strongly sub-quadratic space.

In this paper, the authors optimize the existing sub-quadratic index and make the reporting time (per occurrence) polynomial in the gap between the two occurrences, rather than in the text’s length n. The solution is based on a reduction of the problem to the gapped set intersection problem, for which there exist sub-quadratic solutions running in polynomial time. The authors first use this solution as a blackbox to make the reporting running time polynomial in the gaps, rather than in the input’s size. The overall idea of the solution is simple (the finer details are easy to figure out): essentially, build a binary tree over the universe, where leaves span an interval of size at most 2G, and overlap by G positions. Then, repeat the construction for exponentially-increasing G. At query time, split the range [a,b] into exponentially-increasing sub-ranges, and query them on the appropriate structure (i.e. appropriate G value). Somewhat less simple is the analysis (I did not spot mistakes).

Pros: the solution slightly improve the state of the art, making the reporting time polynomial in the gaps length rather than in the text length.

Cons: the solution is easy and not particularly surprising. The writing is sloppy in some places and could be made more precise.

Overall, I found the contribution to be appropriate for the journal, provided that the authors fix my minor comments.

Comments on the Quality of English Language

Some comments:

- line 98: add “integer” universe, or at least specify the meaning of operator + over the universe.

- line 115: unbalanced parenthesis

- line 121: the sentence is incomplete, or the parentheses are not used in the correct way: “answering queries (…), where N …”

- 129: both arrays _are_ decomposed

- 130: 2^k should be 2^j

- 150: what do you mean with “for implementation”?

- lines 154-159 are very vague. Could you be more formal here?

- 258 “is called on in” does not sound right.

- 326: accross -> across

- 343: it’s a bit misleading to say that the query time is parameterized by the gap length, since you still have a n^delta term. Maybe it would be better to say that just the reporting time is parameterized.

Author Response

• Comment: line 98: add “integer” universe, or at least specify the meaning
of operator + over the universe.

• Response: We modified the sentence to include that it is an integer universe

• Comment: line 115: unbalanced parenthesis

• Response: fixed

• Comment: line 121: the sentence is incomplete, or the parentheses are
not used in the correct way: “answering queries (. . . ), where N . . . ”

• Response: The Lemma as been rewritten to make the distinction between
the existential and reporting solutions more clear.

• Comment: 129: both arrays ‘are’ decomposed

• Response: ’are’ has been added

• Comment: 130: 2^k should be 2^j

• Response: Fixed

• Comment: 150: what do you mean with “for implementation”?

• Response: We agree that was ambiguous. It has been removed.

• Comment: lines 154-159 are very vague. Could you be more formal here?

• Response: We have rewritten this paragraph to make it less ambiguous.

• Comment: 258 “is called on in” does not sound right.

• Response: Changed to:
Let V be the set of nodes on which Search is executed in Algorithm 2

• Comment: 326: accross → across

• Response: Fixed.

• Comment: 343: it’s a bit misleading to say that the query time is parameterized
by the gap length, since you still have a n^δ term. Maybe it
would be better to say that just the reporting time is parameterized.

• Response: Changed to the reviewer’s suggestion.

Reviewer 3 Report

Comments and Suggestions for Authors

Authors revisit the indexed gapped pattern matching problem and engineer a solution with gap-sensitive query time. The paper is written in a self-contained manner, apart from using two lemmas on gapped set intersection as black boxes. The self-contained material covers all aspects related to text indexing, starting from a reduction that enables the use of the aforementioned lemmas. The solution is developed in phases, starting with an easier to follow but inefficient solution (linear blocking scheme) and continuing with the more involved and efficient solution (hierarchical blocking scheme). The analysis of the final solution is insightful, requiring adaptation of Kraft-McMillan inequality.

Some minor remarks:
- line 129: Both arrays [are] decomposed
- line 142: You use array A[1..N], but the union of S_i's could be shorter. Maybe one could mention that the rest of the array can be filled arbitrarily.
- line 263: extra "are"
- line 279: overall -> over all
- line 308: effect -> affect

Author Response

• Comment: Both arrays [are] decomposed

• Response: Fixed.

• Comment: line 142: You use array A[1..N], but the union of Si’s could
be shorter. Maybe one could mention that the rest of the array can be
filled arbitrarily.

• Response: Good point. We modified the line to make N the cardinality of the union. Since this value of N is always bound above by the N as defined in Problem 3, the arguments used still hold.

• Comment: line 263: extra ”are”

• Response: Fixed.

• Comment: line 279: overall → over all

• Response: Fixed.

• Comment: line 308: effect → affect

• Response: Fixed.

Reviewer 4 Report

Comments and Suggestions for Authors

The paper addresses the gapped string indexing problem, which involves preprocessing a text to efficiently locate all occurrences of a gapped pattern. A gapped pattern $P = P_1[\alpha \dots \beta]P_2$ matches the text $T$ at positions $(i,j)$ if for text $T$ exists a match in $i$ for $P_1$ and a match in $j$ for $P_2$ and the gap between $i$ and $j$ minus the length of $P_1$ is between the value $\alpha$ and $\beta$. The authors build upon recent work, which introduced a sub-quadratic space index with sub-linear query times. The main contribution of this paper is an enhanced indexing structure that offers gap-sensitive query times, making it faster when numerous occurrences have small gaps. Overall, the paper presents a solid theoretical contribution to the field of gapped string indexing with its innovative gap-sensitive query mechanism and rigorous analysis. **Merits:** - Building upon the latest results [STACS 2024], This progression demonstrates the authors' ability to push the boundaries of current research, setting new benchmarks in the field of gapped string indexing. - The assumption of small gaps is well-justified, as many real-world string problems exhibit such structures. - The authors provide comprehensive theoretical analysis, including detailed proofs of correctness, space complexity, and query time. The constructive proof itself is also nontrivial and noteworthy. **Minor Issues:** - The paper does not clearly explained how the preprocessing phase differs from that in the [STACS 2024] paper. A more explicit comparison would help readers understand the novel aspects of the proposed method. - While the algorithm is theoretically sound and not overly complex to implement, the paper lacks empirical comparisons with baseline methods. Including experimental results would strengthen the paper by demonstrating practical performance gains. Suggestions / Discussions: - When gaps are large, the proposed gap-sensitive algorithm may underperform compared to the [STACS 2024] baseline. It would be beneficial to explore the possibility of a meta algorithm that employs a unified preprocessing process while maintaining gap sensitivity for small gaps. Such an approach could potentially offer the best of both worlds, ensuring efficient performance across a wider range of gap sizes.

Author Response

Comment: The paper does not clearly explain how the preprocessing
phase differs from that in the [STACS 2024] paper. A more explicit comparison
would help readers understand the novel aspects of the proposed
method.

• Response: We added to the introduction a few lines to this effect. Essentially, our solution can be viewed as performing extra-preprocessing in the case where some gap bound G is known in advance.
Our preprocessing techniques differ from those of Bille et al. [19] in that we rely on the bounded nature of the gaps to perform the proposed blocking technique. Indeed, Theorem 2 is accomplished through extra preprocessing (blocking and building data structures for blocks) that is possible only when an upper bound G > β is known in advance. As described in Section 3.6,
Theorem 1 is then achieved by applying Theorem 2 for different ranges of G.

• Comment: While the algorithm is theoretically sound and not overly
complex to implement, the paper lacks empirical comparisons with baseline
methods. Including experimental results would strengthen the paper
by demonstrating practical performance gains.

• Response: We feel that the main merit of this paper is the theoretical
aspects of the time complexity and space bounds. As we are unaware
of any implementations that could serve for good practical comparisons,
we are uncertain of the potential benefits of providing an implementation
for the current manuscript. We hope that the reviewer will be willing to
accept this work on its current merits and leave the implementation for
future work.

• Comment: When gaps are large, the proposed gap-sensitive algorithm
may underperform compared to the [STACS 2024] baseline. It would be
beneficial to explore the possibility of a meta algorithm that employs a
unified preprocessing process while maintaining gap sensitivity for small
gaps. Such an approach could potentially offer the best of both worlds,
ensuring efficient performance across a wider range of gap sizes.

• Response: We added a paragraph to the conclusion pointing to this as
a potential future research direction.

Reviewer 5 Report

Comments and Suggestions for Authors

This paper discusses a method to improve on the results of Bille et al. [19] regarding the text indexing for gapped pattern matching. The results are clearly presented and the paper is well written. In particular, the proofs of the computational complexities of the proposed algorithm is carefully written.

The weaknesses of this paper are that the idea of the method is not new and the improvement in computational complexity is subtle. Since no experimental results have been presented, it is questionable how effective this algorithm would be if implemented in practice.

Other comments:

L120-121, Lemma 3) s(N) and t(N) are not explained in the text.

Equation between L333 and L334) Since this expression exceeds the width of the line, it would be better to break the line at the inequality sign.

Author Response

• Comment: The weaknesses of this paper are that the idea of the method
is not new and the improvement in computational complexity is subtle.
Since no experiment was implemented in practice.

• Response: As discussed in response to an earlier reviewer’s comment,
we believe that this work is primarily theoretical in nature, and hope the
reviewer can leave the implementation for future research.

• Comment: L120-121, Lemma 3) s(N) and t(N) are not explained in the
text.

• Response: We rewrote the first line of this Lemma to hopefully make
the definitions of s(N) and t(N) more clear.

• Comment: Equation between L333 and L334) Since this expression exceeds
the width of the line, it would be better to break the line at the
inequality sign.

Article Menu

Text Indexing for Faster Gapped Pattern Matching

Further Information

Guidelines

MDPI Initiatives

Follow MDPI