Next Article in Journal
Combining Heuristics with Simulation and Fuzzy Logic to Solve a Flexible-Size Location Routing Problem under Uncertainty
Next Article in Special Issue
Approximation Ratios of RePair, LongestMatch and Greedy on Unary Strings
Previous Article in Journal
An FPTAS for Dynamic Multiobjective Shortest Path Problems
 
 
Article
Peer-Review Record

Non-Overlapping LZ77 Factorization and LZ78 Substring Compression Queries with Suffix Trees

Algorithms 2021, 14(2), 44; https://doi.org/10.3390/a14020044
by Dominik Köppl
Reviewer 1:
Reviewer 2: Anonymous
Algorithms 2021, 14(2), 44; https://doi.org/10.3390/a14020044
Submission received: 1 January 2021 / Revised: 24 January 2021 / Accepted: 26 January 2021 / Published: 29 January 2021
(This article belongs to the Special Issue Algorithms and Data-Structures for Compressed Computation)

Round 1

Reviewer 1 Report

The contribution of the paper is twofold. The first part continues the work of the author on the computation of Lempel-Ziv parsings using succinct/compact suffix trees that apparently was started with his LZ-CISS/CICS algorithms. Here he addresses the non-overlapping version of LZ77 and, in addition, the LPF array. To tackle the non-overlapping case, the author modifies his previous approach in a non-trivial novel way and thus obtains (near-)linear construction algorithms with small space.

In the second part the author applies similar techniques to the problem of substring compression queries, which is to construct a data structure on a string that allows us to retrieve the compressed LZ78 representation of any substring in time close to linear in the size of the representation. This problem was studied in the literature in the setting of LZ77 rather than LZ78 (LZ78 is inferior) but the present data structure, as it seems, is the first one with compact space.

The results of the paper are sound and original. I recommend to accept the paper after a minor revision.

Minor suggestions and corrections:
lines 15, 17: I suggest to replace "the factor F_x" -> "each factor F_x, for x \in [1..z]," and remove "for every x \in [1..z]" in line 19
line 17: maybe it's better to say that F_x "has the only occurrence in F_1 ... F_x (as a suffix)"? Also the definition does not include the case of the last factor that can have other occurrences
line 25: "we can obtain" -> "we obtain" (first reading it, I assumed because of this "can" that this theorem is an already known result)
line 27: "an alphabet" -> "an integer alphabet"
line 41: "an alphabet" -> "an integer alphabet"
line 71: apparently, the leaf-rank is the rank in the lexicographical order of all suffixes, not the preorder rank in the whole tree, as it follows from the following discussion where it is used in \Psi[leaf_rank(L)]. Correct it if I'm not wrong, or clarify.
line 79: "PLCP[i] = LCP[ISA[i]]" (not SA)
line 116: "F_x = T[k_x .. k_x + max(...) - 1]" (missed "-1")
line 230: "of size n bits" -> "of size 2n bits". If the RMQs take O(1) time, the size must be 2n bits; with n bits one has to access the array O(1) times so the query time then will be O(t_A)
line 230: "an RMQs" -> "RMQs"
line 237: "SA and the PLCP array" -> "LCP array". I'm not sure which exactly approach is implied here but the standard one is to use an RMQ data structure on the LCP array.
line 238: "i \mapsto PLCP[SA[i]] for i \in [1..n]" -> "it" (meaning the LCP array); by the definition of PLCP, we have PLCP[SA[i]] = LCP[i], so this mapping makes little sense
line 282: it seems for me that the algorithm is incorrect in the present form. The problem is that when we go by a suffix link from w to sl(w) = \tilde{w}, then sl(parent(w)) is not necessarily the parent of sl(w). This is an issue if w was a Type 3 witness. Then, the witness actually is an implicit node on the edge (parent(w), w) and the suffix link translation does not allow us to locate a node on the path sl(parent(w)) -> sl(w) that corresponds to this witness. I believe that the correct algorithm is to translate parent(w) into sl(parent(w)) and to continue descending from sl(parent(w)) rather than from sl(w). But the analysis of this algorithm is slightly more complicated. Anyway, something must be corrected here and explained more clearly.
line 376: p = |F_1 \cdots F_{x-1}|+1 (missed "+1")

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

The author considers two problems: construction of the LZ77 (in the non-overlapping LZSS variant) factorization of the input text
and the construction of the data structure for the input text T, which, given an interval I, returns an LZ78 representation of T[I].
In both cases the main goal is to reduce the space used by the algorithm, while maintaining the optimal or close to optimal running time.
This is achieved using the modern data structures for suffix arrays (and suffix trees) as well as some combinatorial analysis of the underlying problems.

The non-overlapping LZ77 factorization (in the LZSS variant) represents the input string w as a concatenation f_1f_2...f_k,
where each f_i is either a single letter and the first occurrence of this letter in the string or the longest prefix of f_i ...f_k
appearing in f_1...f_{i-1}.
The author refines the approach used previously by the author (and others) to compute a similar factorization in the overlapping case:
we traverse the suffix tree bottom up for the first node of the new factor, noting, which nodes are visited as first,
which allows to determine the new factor. This works for overlapping variant,
for non-overlapping we traverse top-down, moreover, some extra care is needed to limit the space usage.
While in general the approach is similar, many combinatorial observations and appropriate usage of data structures are needed,
as the non-overlapping case is different (more difficult) than the overlapping one.
The approach can be used also to compute a related table of longest previous non-overlapping factors.

The LZ78 factorization of w represents the input string w as a concatenation f_1f_2...f_k,
where each f_i is a pair (f_i',a), where f_i = f_i'a, a is a single letter and f_i' is the longest prefix among f_1, ..., f_{i-1}
which is a prefix of f_i ... f_k.
To construct a LZ78 factorization of the given substring,
the author considers a centroid decomposition of a suffix tree of w,
in which we additionally mark the nodes corresponding to LZ78 factors of T[I].
This approach, while using known data structures and concepts, requires again additional work and combinatorial observations.
In particular, it is possible to obtain th LZ78 in space related to the LZ78 space of the output.

For both problems the obtained results propose some trade-offs between time and space,
but in all cases the stress is on small space computation.

EVALUATION
I believe that the paper deserves to be published in Algorithms.

I find the paper interesting, it uses up to date techniques to tackle fundamental problems.
In all cases some additional combinatorial analysis is needed to obtain the results.

On the negative, all the tools are known and the results are not ground-breaking.


In general, I find the English good or maybe even very good, I spotted no mistakes nor even typos.
On the other hand, I would like the author to try to improve the presentation.
At the moment there are a lot of informal references to previous steps, and it is essentially assumed
that the reader can fit the whole construction to his head and understands the informal references of the author
to the parts of this construction.
In almost all cases this can indeed be deduced, but some small improvements would greatly improve the readability.

COMMENTS FOR THE AUTHOR
Whenever you use some operations on the data structure, like
leaf_rank(next_leaf(\lambda)), please provide also the intuitive meaning of this operation: I understand that in
data structures the crucial thing is that the desired operation can be indeed performed by the provided ones,
but it is equally important for the reader to actually understand, what we are trying to achieve.

160: ``this situation is the same as in the overlapping LZSS factorization)

It is unclear, what is meant. Perhaps write that here the choice of factor is the same as in LZSS factorization or something like this.

198: an example of the general comment: please write that the two intervals correspond to the factor and its previous occurrence.

368: please recall that LZ trie nodes are consecutive on the edge, so their number corresponds to the lowest one.

385/386
either u and v is w

I am not sure what this is supposed to mean: that one of u, v is w? Then I would use either u or v is w.

400 In your application (substring compression queries) you do not use fringed marked ancestors explicitly,
as you begin the exploration from some designated node, so you rather apply it to the subtree of the suffix tree.
Please comment, whether the used data structures (to be more specific, the one from [4])
also work in this case (I guess so). In particular, is there any problem, when your root is an implicit node in the suffix tree?

404: please explain, what an exponential search is

404: Please explain, what this conceptual array is, or better, say in plain words, what you want to compute and how do you do it using
the data structure.

430: after conducting a pass as described above.

Unclear, what pass. Please comment in 4.4 that we have that info after describing the pass.

431: Therefore, we can mark the preorder numbers of the edge witnesses in a bit vector of length at most 2n.

This was confusing to me. Maybe ``positions corresponding to numbers'' or something in this direction.
Or just ``mark i-th bit when ...''

459: explain BP

465: ``to explain the search''

What search.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

The paper was nicely complemented with additional clarifications and figures. I verified the new text and believe it can be accepted in its present form.

Reviewer 2 Report

I am happy with the author response and the new version of the submission, I believe it can be accepted as is.

Back to TopTop