Efﬁcient Data Structures for Range Shortest Unique Substring Queries †

: Let T [ 1, n ] be a string of length n and T [ i , j ] be the substring of T starting at position i and ending at position j . A substring T [ i , j ] of T is a repeat if it occurs more than once in T ; otherwise, it is a unique substring of T . Repeats and unique substrings are of great interest in computational biology and information retrieval. Given string T as input, the Shortest Unique Substring problem is to ﬁnd a shortest substring of T that does not occur elsewhere in T . In this paper, we introduce the range variant of this problem, which we call the Range Shortest Unique Substring problem. The task is to construct a data structure over T answering the following type of online queries efﬁciently. Given a range [ α , β ] , return a shortest substring T [ i , j ] of T with exactly one occurrence in [ α , β ] . We present an O ( n log n ) -word data structure with O ( log w n ) query time, where w = Ω ( log n ) is the word size. Our construction is based on a non-trivial reduction allowing for us to apply a recently introduced optimal geometric data structure [Chan et al., ICALP 2018]. Additionally, we present an O ( n ) -word data structure with O ( √ n log (cid:101) n ) query time, where (cid:101) > 0 is an arbitrarily small constant. The latter data structure relies heavily on another geometric data structure [Nekrich and Navarro, SWAT 2012].


Introduction
Finding regularities in strings is one of the main topics of combinatorial pattern matching and its applications [1].Among the most well-studied types of string regularities is the notion of repeat.Let T[1, n] be a string of length n.A substring T[i, j] of T is called a repeat if it occurs more than once in T. The notion of unique substring is dual: it is a substring T[i, j] of T that does not occur more than once in T. Computing repeats and unique substrings has applications in computational biology [2,3] and information retrieval [4,5].
In this paper, we are interested in the notion of shortest unique substring.All of the shortest unique substrings of string T can be computed in O(n) time using the suffix tree data structure [6,7].Many different problems based on this notion have already been studied.Pei et al. [4] considered the following problem on the so-called position (or point) queries.Given a position i of T, return a shortest unique substring of T covering i.The authors gave an O(n 2 )-time and O(n)-space algorithm, which finds the shortest unique substring covering every position of T. Since then, the problem has been revisited and optimal O(n)-time algorithms have been presented by Ileri et al. [8] and Tsuruta et al. [9].Several other variants of this problem have been investigated [10][11][12][13][14][15][16][17][18][19].
We introduce a natural generalization of the shortest unique substring problem.Specifically, our focus is on the range version of the problem, which we call the Range Shortest Unique Substring (rSUS) problem.The task is to construct a data structure over T to be able to answer the following type of online queries efficiently.Given a range [α, β], return a shortest substring , and h is minimal.Note that this substring, T[k, k + h − 1], may end at a position k + h − 1 > β.Further note that there may be multiple shortest unique substrings.
Range queries are a classic data structure topic [20][21][22].A range query q = f (A, i, j) on an array of n elements over some set S, denoted by A[1, n], takes two indices 1 ≤ i ≤ j ≤ n, a function f defined over arrays of elements of S, and outputs ). Range query data structures have also been specifically considered for strings [23][24][25][26].For instance, in bioinformatics applications we are often interested in finding regularities in certain regions of a DNA sequence [27][28][29][30][31].In the Range-LCP problem, defined by Amir et al. [23], the task is to construct a data structure over T to be able to answer the following type of online queries efficiently.Given a range [α, β], return i, j ∈ [α, β] such that the length of the longest common prefix of T[i, n] and T[j, n] is maximal among all pairs of suffixes within this range.The state of the art is an O(n)-word data structure supporting O(log O(1) n)-time (polylogarithmic-time) queries [25] (see also [26,32]).

Main Problem and Main Results
An alphabet Σ is a finite nonempty set of elements called letters.We fix a string T we denote the substring of T starting at position i and ending at position j of T. We say that another string P has an occurrence in T or, more simply, that P occurs in T if P = T[i, i + |P| − 1], for some i.Thus, we characterize an occurrence of P by its starting position i in T. A prefix of T is a substring of T of the form T[1, i] and a suffix of T is a substring of T of the form T[i, n].
We next formally define the main problem considered in this paper.If α = β the answer (α, 1) is trivial.So, in the rest we assume that α < β. we need to find a shortest substring of T with exactly one occurrence in [5,16].The output here is (p, ) = (10, 2), because T [10,11] = ac is the shortest substring of T with exactly one occurrence in [5,16].

Problem rSUS
Our main results are summarized below.We consider the standard word-RAM model of computations with w-bit machine words, where w = Ω(log n), for stating our results.Theorem 1.We can construct an O(n log n)-word data structure that can answer any rSUS query on T [1, n] in O(log w n) time.
Theorem 2. We can construct an O(n)-word data structure that can answer any rSUS query on T [1, n] in O( √ n log n) time, where > 0 is an arbitrarily small constant.

Paper Organization
In Section 2, we prove Theorem 1 and, in Section 3, we prove Theorem 2. We conclude this paper in Section 4 with some future proposals.An early version of this paper appeared as [33].When compared to that early version ( [33]), Theorem 2 is new.

An O(n log n)-Word Data Structure
Our construction is based on ingredients, such as the suffix tree [7], heavy-light decomposition [34], and a geometric data structure for rectangle stabbing [35].Let us start with some definitions.Definition 1.For a position k ∈ [1, n] and h ≥ 1, we define Prev(k, h) and Next(k, h), as follows: Intuitively, let x and y be the occurrences of T[k, k + h − 1] right before and right after the position k, respectively.Subsequently, Intuitively, λ(a, b, k) denotes the length of the shortest substring that starts at position k with exactly one occurrence in [a, b].Definition 3.For a position k ∈ [1, n], we define C k , as follows: Intuitively, C k stores the set of candidate lengths for the shortest unique substrings starting at position k.We make the following observation.The following combinatorial lemma is crucial for efficiency.
Proof.The proof of Lemma 1 is deferred to Section 2.1.
We are now ready to present our construction.By Observation 1, for a given query range [α, β], the answer (p, ) that we are looking for is the pair (k, h) with the minimum h under the following conditions: with the minimum h, such that h ∈ C k , α ∈ (Prev(k, h), k], and β ∈ [k, Next(k, h)).We map each h ∈ C k into a weighted rectangle R k,h with weight h, which is defined as follows: Let R be the set of all such rectangles, then the lowest weighted rectangle in R stabbed by the point (α, β) is R p, .In short, an rSUS query on T[1, n] with an input range [α, β] can be reduced to an equivalent top-1 rectangle stabbing query on a set R of rectangles with input point (α, β).In the 2-d Top-1 Rectangle Stabbing problem, we preprocess a set of weighted rectangles in 2-d, so that given a query point q the task is to report the largest (or lowest) weighted rectangles containing q [35].Similarly, here, the task is to report the lowest weighted rectangle in R containing the point (α, β) (see Figure 1 for an illustration).By Lemma 1, we have that |R| = O(n log n).Therefore, by employing the optimal data structure for top-1 rectangle stabbing presented by Chan et al. [35], which takes O(|R|)-word space supporting O(log w |R|)-time queries, we arrive at the space-time trade-off of Theorem 1.This completes our construction.

T :
1 Problem Reduction

Proof of Lemma 1
Let lcp(i, j) denote the length of the longest common prefix of the suffixes of T starting at positions i and j in T. Additionally, let S denote the set of all (x, y) pairs, such that 1 ≤ x < y ≤ n and lcp(x, y) > lcp(x, z), for all z ∈ [x + 1, y − 1].The proof can be broken down into Lemma 2 and Lemma 3.
The following statements can be deduced by a simple contradiction argument: Figure 2 illustrates the proof for the first statement.The second one can be proved in a similar fashion.

Lemma 3. |S| = O(n log n).
Consider the suffix tree data structure of string T[1, n], which is a compact trie of the n suffixes of T appended with a letter $ / ∈ Σ [7].This suffix tree consists of n + 1 leaves (one for each suffix of T) and at most n internal nodes.The edges are labeled with substrings of T. Let u be the lowest common ancestor of the leaves corresponding to the strings T[x, n]$ and T[y, n]$.Subsequently, the concatenation of the edge labels on the path from the root to u is exactly the longest common prefix of T[x, n]$ and T[y, n]$.For any node u, we denote, by size(u), the total number of leaf nodes of the subtree rooted at u.
We decompose the nodes in the suffix tree into light and heavy nodes.The root node is light and for any internal node, exactly one child is heavy.Specifically, the heavy child is the one having the largest number of leaves in its subtree (ties are broken arbitrarily).All other children are light.This tree decomposition is known as heavy-light decomposition.We have the following critical observation.Any path from the root to a leaf node contains many nodes; however, the number of light nodes is at most log n [34,36].Additionally, corresponding to the n + 1 leaves of the suffix tree, there are n + 1 paths from the root to the leaves.Therefore, the sum of subtree sizes over all light nodes is O(n log n).
We are now ready to complete the proof.Let S u ⊆ S denote the set of pairs (x, y), such that the lowest common ancestor of the leaves corresponding to suffixes T[x, n]$ and T[y, n]$ is u.Clearly, the paths from the root to the leaves that correspond to suffixes T[x, n]$ and T[y, n]$ pass from two distinct children of node u and then at least one of the two must be a light node.There are two cases.In the first case, both leaves are under the light children.In the second case, one leaf is under a light child and the other is under the heavy child.In both cases, we have at least one leaf under a light node.If we fix the leaf that is under the light node, we can enumerate the total number of pairs based on the subtree size of the light nodes.Therefore, |S u | is at most twice the sum of size(•) over all light children of u.Since |S| = ∑ u |S u |, we can bound |S| by the sum of size(•) over all light nodes in the suffix tree, which is O(n log n).This completes the proof of Lemma 3.

An O(n)-Word Data Structure
This section is dedicated to proving Theorem 2. For simplicity, we only focus on the computation of the length of the output (p, ).
Let SA be the suffix array of string T of length n , which is a permutation of {1, • • • , n}, such that SA[i] = j if T[j, n] is the ith lexicographically smallest suffix of T [37].Further, let SA −1 be the inverse suffix array of string T of length n, which is a permutation of {1, Moreover, SA of T can be constructed in linear time and space [38,39].
We observe that an O(β − α + 1)-time solution is straightforward with the aid of the suffix tree of T as follows.First, identify those leaves corresponding to the suffixes starting within [α, β] using the inverse suffix array of T and mark them.for each marked leaf, identify its lowest ancestor node (and double mark it), such a marked is under it.This can be done via at most two O(1)-time Lowest Common Ancestor (LCA) queries over the suffix tree of T while using O(n) additional space [22].Afterwards, find the minimum over the string-depth of all double-marked nodes, add 1 to it, and report it as the length .The correctness is readily verified.
We employ the above procedure when β − α + 1 < 3∆, where ∆ is a parameter to be set later.We now consider the case when β − α + 1 ≥ 3∆.Note that is the smallest element in Let α be the smallest number after α and β be the largest number before β, such that α and β are multiples of ∆.Subsequently, S * can be written as the union of Furthermore, S can be written as S 1 ∪ S 2 ∪ S 3 , where Our construction is based on a solution to the Orthogonal Range Predecessor/Successor in 2-d problem.A set P of n points in an [1, n] × [1, n] grid can be preprocessed into a linear-space data structure, such that the following queries can be answered in O(log n) time per query [40]: We next show how to maintain additional structures, so that the smallest element in each of the above sets can be efficiently computed and, thus, the smallest among them can be reported as .

•
Computing the Smallest Element in S :

Figure 1 .
Figure 1.Illustration of the problem reduction: (k, h) is the output of the rSUS with query range [α, β], where h = λ(α, β, k) ∈ C k .R k,h is the lowest weighted rectangle in R containing the point (α, β).
[40]], we compute λ(α, β, k) and report the smallest among them.We handle each λ(α, β, k) query in time O(log n), as follows: first find the leaf corresponding to the string position k in the suffix tree of T, then the last (resp., first) leaf on its left (resp., right) side, such that the string position x (resp., y) corresponding to it is in [α, β], and report 1 + max{lcp(k, x), lcp(k, y)}.To efficiently enable the computation of x (resp., y), we preprocess the suffix array into an O(n)-word data structure that can answer orthogonal range predecessor (resp., successor) queries in O(log n) time[40].Subsequently, t = max{lcp(r, x), lcp(r, y)} is the length of the longest prefix of T[r, n] with an occurrence d in [α , β ].However, we need to verify whether occurrence d is unique and its Prev(d, t) = r.For this, find the two leftmost occurrences of T[r, r + t − 1] after r, denoted by x and y (x < y ), via two orthogonal range successor queries.If y does not exist, set y = +∞.Then report λ(α, β, d) if α ≤ x ≤ β < y .Otherwise, report +∞.• Computing the Smallest Element in S 2 : for each r ∈ [β + 1, β + ∆], we compute the smallest element in {h = λ(α, β, k) | k ∈ [α , β ], Next(k, h) = r} and report the smallest among them.The procedure is analogous to that of S 1 ; i.e., find the length t of the longest prefix of T[r, n] with an occurrence d in [α , β ].Then, find the two rightmost occurrences of T[r, r + t − 1] before r, denoted by x and y (x < y ), via two orthogonal range successor queries.If x does not exist, set x = +∞.Subsequently, report λ(α, β, d) if x < α ≤ y ≤ β .Otherwise, report +∞.• Computing the Smallest Element in S 3 : the set S 3 can be written as {λ(α − ∆, β + ∆, k) | k ∈ [α , β ]}, which is now dependent only on α , β and ∆.Therefore, our idea is to pre-compute and explicitly store the minimum element in {λ(a − ∆, b + ∆, k) | k ∈ [a, b]} for all (a, b) pairs, where both a and b are multiples of ∆, and for that the desired answer can be retrieved in constant time.The additional space needed is O((n/∆) 2 ).set ∆ = √ n .The total space is then O(n) and total time is O(∆ log n) = O( √ n log n).Therefore, we arrive at Theorem 2.
• Computing the Smallest Element in S 1 : for each r ∈ [α − ∆, α − 1], we compute the smallest element in {h = λ(α, β, k) | k ∈ [α , β ], Prev(k, h) = r}and report the smallest among them.The procedure is the following: find the leaf corresponding to the string position r in the suffix tree of T and the last (resp., first) leaf on its left (resp., right) side, such that its corresponding string position x (resp., y) is in [α , β ] (via orthogonal range successor/predecessor queries as earlier).