More Time-Space Tradeoffs for Finding a Shortest Unique Substring

: We extend recent results regarding ﬁnding shortest unique substrings (SUSs) to obtain new time-space tradeoffs for this problem and the generalization of ﬁnding k -mismatch SUSs. Our new results include the ﬁrst algorithm for ﬁnding a k -mismatch SUS in sublinear space, which we obtain by extending an algorithm by Senanayaka (2019) and combining it with a result on sketching by Gawrychowski and Starikovskaya (2019). We ﬁrst describe how, given a text T of length n and m words of workspace, with high probability we can ﬁnd an SUS of length L in O ( n ( L / m ) log L ) time using random access to T , or in O ( n ( L / m ) log 2 ( L ) log log σ ) time using O (( L / m ) log 2 L ) sequential passes over T . We then describe how, for constant k , with high probability, we can ﬁnd a k -mismatch SUS in O ( n 1 + (cid:101) L / m ) time using O ( n (cid:101) L / m ) sequential passes over T , again using only m words of workspace. Finally, we also describe a deterministic algorithm that takes O ( n τ log σ log n ) time to ﬁnd an SUS using O ( n / τ ) words of workspace, where τ is a parameter.


Introduction
A shortest unique substring (SUS) of a given text T[1..n] is a substring containing a given position T[q] and occurring only once in T, such that every shorter substring containing T[q] occurs at least twice in T. For example, if T[1..11] = ABRACADABRA and q = 3, then T[3..5] = RAC is an SUS: it contains T [3], it occurs only once in T, and each of the shorter substrings containing T[3]-i.e., T [3] = R, T[2..3] = BR and T [3..4] = RA-occurs at least twice in T. The problem of finding SUSs has attracted significant attention recently and many variants have been proposed-interval-SUSs, k-mismatch SUSs, palindromic-SUSs and range-SUSs-so what we refer to simply as SUS here are sometimes also called position-SUSs. In this paper, we are only interested in position-SUSs and k-mismatch SUSs, which we describe shortly. We refer readers to Abedin et al.'s very recent survey [1] for a more detailed discussion.
Finding exact and approximate SUS has several applications in bioinformatics, including alignment-free genome comparison, PCR primer design, and the identification of DNA signatures distinguishing closely related organisms [2][3][4]. Pei, Wu, and Yeh [3] gave the definition of SUSs above, along with an O(n)-time and O(n)-space algorithm for finding an SUS for T[q] given q, and an O(n 2 )-time, O(n)-space construction algorithm for an O(n)-space data structure that in O(1) time returns the endpoints of an SUS for T[q] given q. Hu, Pei, and Tao [5],İleri, Külekci and Xu [6], and Tsuruta, Inenaga, Bannai, and Takeda [7] independently improved the construction time to O(n). Belazzougui and Cunial [8] reduced both the construction space and the space of the data structure to O n log σ log n at the cost of increasing the construction time to O(n log σ), where σ is the size of the alphabet, while keeping the query time constant. Ganguly et al. [9] gave the following time-space tradeoffs for finding SUSs, assuming we have random access to T and τ is a parameter: • given q, we can find an SUS for T[q] in O(nτ 2 log(n/τ)) time and O(n/τ) space; • in O(nτ 2 log n) time and O(n/τ) words plus 4n + o(n) bits of space, we can build a (4n + o(n))-bit data structure answering SUS queries in O(1) time; and, • we can change the running time in both cases to O(nτ log c+1 n) at the cost of increasing the space by an additive O(n/ log c n) and allowing for a low probability that the substrings are not shortest.
Recently, Senanayaka [10] gave a simple low-memory randomized algorithm based on Karp-Rabin pattern matching, but did not give theoretical bounds for it.
Hon, Thankachan, and Xu [11] generalized the problem by defining a k-mismatch SUS to be a shortest substring containing T[q] that is not only unique, but also not within Hamming distance k of any other substring of T. For example, if T[1..11] = ABRACADABRA and q = 5, then T[3..5] = RAC is a one-mismatch SUS, because no other substring is within Hamming distance 1 of it and each of the shorter substrings containing T[5]-i.e., T [5] = C, T[4..5] = AC and T[5..6] = CA-is within edit distance of some other substring. On the other hand, T[4..6] = ACA is not a 1-mismatch SUS, although it also has length 3, because the Hamming distance between ACA and ADA is 1.
Hon et al. gave an O(n 2 )-time, O(n)-space construction of an O(n)-space data structure, which, given q, in O(1) time returns the endpoints of a k-mismatch SUS for T[q]. Allen, Thankachan, and Xu [12] reduced the construction time to O(n log k n) at the cost of increasing the construction space to O(kn), and Schultz and Xu [13] gave a GPU algorithm that is fast in practice.
In Section 2, we show how Senanayaka's approach can be extended, such that, given q and m words of workspace and random access to T, with high probability we can find an SUS containing T [q] in O((L/m + 1)n log L) time. In Section 3, we show that, replacing Karp-Rabin pattern matching by a result on sketching by Golan and Porat [14], we can use O((L/m + 1) log 2 L) sequential passes over T instead of random access, at the cost of increasing the time to O((L/m + 1)n log 2 L) and requiring m = ω(log L). Replacing Golan and Porat's result by one by Gawrychowski and Starikovskaya [15], for constant k, we can find a k-mismatch SUS for T[q] in O(n 1+ L/m) time using O(n L/m) sequential passes over T, now requiring m = ω(n ). Although the sketching results that we rely on are too sophisticated for us to explain them here, and to recent for us to be able to refer to one other than Golan and Porat's and Gawrychowski and Starikovskaya's papers themselves, we only use them only as black boxes, without relying on the details of how they work.
In Section 4, we describe a deterministic algorithm that makes use of directed acyclic word graphs (DAWGs) [16], the Crochemore-Perrin pattern matching algorithm [17], and suffix trees [18][19][20], to compute an SUS in O(nτ log σ log n) time using O(n/τ) words of workspace, improving Ganguly et al.'s result when τ log(n/τ) = ω(log σ log n). Finally, in Section 5 we discuss some possible directions for future work. Table 1 summarizes known bounds for finding SUSs and k-mismatch SUSs, including those that we give in this paper. Table 1. Previous bounds and our results for finding SUSs (the first twelve results) and k-mismatch SUSs (the last three), where n is the length of the text, σ is the alphabet size, L is the length of the SUS or k-mismatch SUS, and m, τ < n and c are parameters. Theorem 3 is Monte Carlo and we require m = ω(n ), so that the probability of failure can be made inversely proportional to any fixed polynomial of n. All of the data structures return an SUSs given q in O(1) time and have the same final space as construction space, except that the construction-space bounds for Ganguly et al.'s third and fourth results are O(n/τ) words plus 4n + o(n) bits of space, and O(n/τ + n/ log c n) words plus 4n + o(n) bits.

Tradeoffs with Karp-Rabin Pattern Matching
If we know an SUS for T[q] has length at most L, then we can search in T for repetitions of the substrings of T[q − L + 1..q + L − 1] in T, using O(L) words of workspace and O(n) time. To do this, we build a suffix tree [18][19][20] for T[q − L + 1..q + L − 1] and scan T, always descending in the suffix tree as much as possible and then following a suffix link. Suppose that, at some point, we have just read T[i] and we are at string depth d in the suffix tree. If T[i − d + 1..i] is not completely contained in T[q − L + 1..q + L − 1] and d is the largest string depth we have reached along the edge we are currently descending, then we mark with d the node below that edge. After we have scanned T, we can extract from the marked suffix tree the shortest string, such that: • its locus is a leaf (meaning it occurs only once in T[q − L + 1..q + L − 1]), • that leaf's label is at most q and its label plus the string's length is at least q (so an occurrence of the string contains T[q]), and • that leaf is not marked with a number greater than or equal to the strings length (meaning we have not seen a copy of the string elsewhere in T).
If we do not know L, then we can find it via exponential search, still using O(L) words of workspace, but O(n log L) time and O(log L) sequential passes over T. Since finding an SUS is relatively easy if we can use workspace proportional to its length, even if that length is unknown, in this paper we assume that we have less workspace. Obviously, we can find an SUS containing T[q] while using O(1) words of workspace if we are willing to spend O(L 3 n) time, where L is the length of that SUS. For example, we can use the simple Algorithm 1. This algorithm can easily be improved to take O(L 2 n log L) time by replacing the linear search for L with an exponential search. It can then be further improved to take O(Ln log L) time with high probability, while still using O(1) words of workspace, by replacing naïve pattern matching with Karp-Rabin pattern matching. The resulting randomized algorithm can be either Monte Carlo if we do not verify matches or Las Vegas if we do, and the probability of failure can be made inversely proportional to any fixed polynomial of n without changing the asymptotic bounds.  If we allow ourselves m words of workspace, then we can make the algorithm run in O((L/m + 1)n log L) time with high probability. To do this, when searching for repetitions of the substrings of length that contain T[q], we process them in /m batches of size m , where m = Θ(m) depends on the ratio of the word-size to log 2 n and the power of n we want in the denominator in the probability of failure. We note that we compute the hashes of the substrings in the same batch by rolling them, in O(L) total time, rather than computing each of them from scratch, which would take O(L 2 ) total time.

Tradeoffs with Sketching
For Karp-Rabin pattern matching, we must keep track of characters leaving a sliding window, for which we need either enough memory to store the contents of the sliding window-which, by assumption, we do not have-or random access to T. However, Golan and Porat [14] gave a Monte-Carlo randomized sketching algorithm that takes d patterns of maximum length , scans T one character at a time, and, for each position, reports the longest pattern with an occurrence ending at that position with probability of failure inversely proportional to any fixed polynomial in n. Their algorithm uses O(log log σ) time per character of T and O(d log ) space and does not use a sliding window, so it needs only sequential access to T. Replacing Karp-Rabin pattern matching with Golan and Porat's result and searching for substrings of length in batches of size Θ(m/ log ), so that we stay within our workspace bound m, we obtain the following result: Theorem 2. With m = ω(log L) words of workspace, with high probability we can find an SUS for T[q] in O(n(L/m) log 2 (L) log log σ) time using O((L/m) log 2 L) sequential passes over T.
Because Golan and Porat's algorithm is Monte Carlo, so is our result; unlike Theorem 1, we cannot easily make it Las Vegas, since verifying matches requires random access. Again, our batch size depends on the ratio of the word-size to log 2 n and the power of n we want in the denominator in the probability of failure. The requirement that m = ω(log L) means the probability of failure can still be made inversely proportional to any fixed polynomial of n.
Gawrychowski and Starikovskaya [15] considered a harder version of the problem Golan and Porat studied, in which we are given a distance k and, for each position in T, we should report all of the patterns within Hamming distance k of substrings of T ending at that position. They gave a Monte-Carlo randomized sketching algorithm that searches for d patterns of length at most using O(klog k d polylog ( ) + occ) time per character of T, where occ is the number of matches reported ending at that character, and O(kd log k d polylog ( )) space. Replacing Golan and Porat's algorithm with Gawrychowski and Starikovskaya's and searching for substrings of length in batches of size Θ(m/n ) for some positive constant , so that we stay within our workspace bound m for constant k, we obtain the following result: Proof of Lemma 1. There are at most possible substrings of length covering p that could be unique. We separate these substrings into m batches of m substrings with adjacent starting positions. Note that there may be one remainder batch with less than m substrings.
Consider a batch B containing k substrings of length , where 1 ≤ k ≤ m. Let B i denote the substring in B with the ith adjacent starting position in B. Subsequently, B 1 is the leftmost substring in B and B k is the rightmost substring in B.
Consider the following substrings: It is useful to think of x as the prefix of B 1 that does not overlap with B k in T. Similarly, z is the suffix of B k that does not overlap with B 1 . Finally, y is the suffix of B 1 and the prefix of B k that overlap with each other in T. Note that any substring in B is equal to x su f f · y · z pre f , where x su f f is some suffix of x, z pre f is some prefix of z and · denotes concatenation. For each batch, we will use the substrings x, y, and z to determine whether any of the k substrings in the batch occur in T.
We can scan T to enumerate all occurrences of suffixes of x, all occurrences of y, and all occurrences of prefixes of z. Using this information we can determine if any of the k substrings in the batch occur elsewhere in T and are therefore not unique. Suppose that we search text T and we find an occurrence of a suffix q of x, an occurrence r of y, and an occurrence of a prefix s of z. If q = T[i..j], r = T[j + 1..j + 1 + − k] and s = T[j + 2 + − k.. + i − 1] for some i, j ∈ [1, n], i ≤ j, then we have found an occurrence of a substring in batch B in T. Specifically, if qrs = B a for some B a ∈ B, then B a is not unique in T.

Finding Occurrences of Suffixes of X
To enumerate all occurrences of suffixes of x in T, we construct the Directed Acyclic Word Graph (DAWG) [16] D(x$) with suffix links. D(x$) is the smallest deterministic automaton which accepts all suffixes of x$ and it is known to have the following properties: 1. Each edge is labeled by a single symbol, and the labels of all outgoing edges from a given node are distinct. The total number of nodes and edges is linear in the length of x$. 2. For any node u, let P u denote the set of strings that can be created by concatenating the labels of any path from the root to u. Subsequently, P u ⊆ Substr(x$), and for any strings p 1 , p 2 ∈ P u , the set of ending positions of occurrences of p 1 and p 2 in x$ are equivalent, i.e., .|p u |]} for some p u (the longest element of P u ) and 1 ≤ k ≤ |p u |. 3. The suffix link l(u) of a node u points to a node v such that the longest element of where p u is the shortest element of P u .
For technical convenience, we can also consider an auxiliary node that the suffix link of the root points to, which has outgoing edges for all symbols to the root. D(x$) can be built in O(m log σ) time and O(m) space [16]. Because the suffix links form a tree, we can also process the suffix link tree in linear time so that each node u holds a pointer l (u) to the deepest ancestor v of u (possibly v = u) that has $ as an outgoing edge, i.e., the longest element p v of P v is the longest suffix of any p ∈ P u that is a suffix of x.
Using D(x$), we can incrementally compute for each j = 1, . . . , n, the position in the DAWG, which corresponds to the longest suffix T[i..j] of T[1..j] that is a substring of x in O(n log σ) overall time. Initially, i = 1, and we start at the root of D(x$). For each character T[j] for j = 1, . . . , n and current node u, we traverse D(x$) by following the edge labelled T[j] from u or if that edge does not exist, try again after following the suffix link to l(u). When the suffix link is traversed, i is incremented, so that the length of T[i..j − 1] matches the length of the longest element in P l(u) (−1 for the auxiliary node, and 0 for the root node).
If, upon reading character T[j * ], we arrive at a node u, then, v = l (u) points to the (possibly empty) longest suffix T[i * ..j * ] of T[1..j * ] that is a suffix of x, so we have detected suffixes of x with lengths 0..j * − i * + 1 starting at positions i ∈ [i * , j * ] and ending at position j * of T. In this manner, we can find occurrences of all suffixes of x in T in O(n log σ) time.

Finding Occurrences of Y
To enumerate all the occurrences of y in T, we preprocess y in O( − m) time and constant space using the Crochemore-Perrin preprocessing algorithm [17]. We can then find all occurrences of y in T in O(n) time and constant space.

Finding Occurrences of Prefixes of Z
To enumerate all occurrences of prefixes of z in T we construct the suffix tree [18][19][20] S(z$) with suffix links. Note that S(z$) contains exactly one node z , such that path(z ) = z$. For each explicit node in S(z$), we add a special pointer to the closest ancestor that is on the path from the root to z . This takes O(m log σ) time and O(m) space. We start at the root of S(z$) and for i = 1, 2, . . . , n we follow the edge labelled T[i] until we reach a node u that has no outgoing edges labelled T[i] available.
We then use the special pointer at node u to find the closest ancestor v of u that is on the path to z . path(v) then yields the longest prefix of z at position i = 1 of T. Furthermore, every prefix of path(v) is also a prefix of z, so we have detected prefixes of z with lengths 0..|path(v)| all starting at position i. To find the longest prefix of z at position i + 1 of T, follow the suffix link of the node u we ended on for input T[i] and repeat this process. In this manner, we can enumerate all the occurrences of all prefixes of z in T in O(n log σ) time.

Putting the Occurrences Together
We now determine which substrings in the batch B are not unique in T. Notice that, for any 1 ≤ i 1 < i 2 < i 3 ≤ k, occurrences of B i 1 and B i 3 which share the same occurrence of y implies an occurrence of B i 2 also sharing the occurrence of y. Therefore, we maintain an integer arrays R of size k + 1, where all elements are initially 0, in order to record the start and end of ranges in B that have been found to occur in T.
We use the DAWG D(x$), the Crochemore-Perrin algorithm for y, and the Suffix Tree S(z$), and maintain three parallel scans on T, shifted so that the three parts are detected in sync. More precisely, at position i, we maintain the following: Theorem 4. There is a deterministic algorithm that computes the shortest unique substring (SUS) of a text T[1..n] that covers some query position p chosen at runtime in O(n/τ) words of workspace and O(nτ log σ log n) time.

Proof of Theorem 4.
If there is a unique substring in T with length and < n, then there is a unique substring in T with length greater than . This property lets us use exponential search over the length of the shortest unique substring (SUS), with Lemma 1 as a sub-algorithm, in order to find the SUS that covers a query position p in O(m) words and O(( L m + 1)n log σ log L) time, where L is the length of the SUS. Setting m = n/τ, this yields a time complexity of O((Lτ + n) log σ log L) ⊆ O(nτ log σ log n) ⊂ o(nτ 2 log σ log(n/τ)).
Algorithm 2 shows the pseudo-code.