A Survey on Shortest Unique Substring Queries

: The shortest unique substring (SUS) problem is an active line of research in the ﬁeld of string algorithms and has several applications in bioinformatics and information retrieval. The initial version of the problem was proposed by Pei et al. [ICDE’13]. Over the years, many variants and extensions have been pursued, which include positional-SUS, interval-SUS, approximate-SUS, palindromic-SUS, range-SUS, etc. In this article, we highlight some of the key results and summarize the recent developments in this area.


Introduction
Let S be a string of length n and S[i, j] be the substring which starts at position i and ends at position j of S. The substring S[i, j] is a repeat if it occurs more than once in S; otherwise, it is a unique substring of S. Since finding the shortest unique substrings is a non-trivial problem that has several applications for different purposes, variants of this problem have been studied. Table 1 shows the variants of the Shortest Unique Substring (SUS) problem that we focused on in this survey.  [3,17] Ganguly et al. 2016 [18] Mieno et al. 2019 [14] In 2005, Haubold et al. [19] explained how the shortest unique substring is a useful construct for alignment-free genome comparison. Unique substrings can help to determine the distinctness and difference between a group of closely related organisms [1,19]. In addition, an algorithm for finding a unique substring can be helpful to build a unique genetic fingerprint from a DNA sample or can help designing polymerase chain reaction (PCR) primer technique in molecule biology [1,20]. In 2015, Adas et al. [21] investigated the usage of shortest unique substrings for alignment and compression of DNA sequences. In addition to the applications in bioinformatics, the shortest unique substrings can be used in information retrieval for document search. The position based shortest unique substring (position-SUS) queries was first proposed by Pei et al. [1]. Given a string S of length n and a query point p in the string, the problem is to find a shortest unique substring covering p. They presented an algorithm which costs O(n 2 ) time and O(n) space. In 2014, several publications have revisited this problem.İleri et al. [6] and Tsuruta et al. [13] proposed an optimal O(n) time and space algorithm for solving this problem. In addition, Hon et al. [3] and Ganguly et al. [18] presented compact data structures that can answer a position-SUS query using less than 2n words of space, where a word size is Ω(log n). Several variations of the original problem have been proposed in the following years.
In 2014, Hu et al. [2] generalized the position-SUS problem by considering the positions in an interval rather than a single position in the string. For the interval-SUS problem, they preprocess the input string in O(n) time and space to answer any interval SUS query in constant time and return all SUSs in O(occ) time, where occ is the number of outputs. In 2019, Mieno et al. [14] improved the space complexity of this problem by providing a compact data structure.
In 2016, Mieno et al. [8] considered SUS queries on run-length encoded(RLE) strings. Their motivation was to reduce space and time complexity of processing the input string. They showed how to construct a data structure of size O(m + π s (N, m)) in O(m log m + π c (N, m)) time that can answer interval-SUS queries in O(π q (N, m) + occ) time, where occ is the output size, and π s (N, m), π c (N, m) and π q (N, m) are the size, construction time and the query time for a predecessor/successor query on m elements from a universe of [1, N].
In 2017, Hon et al. [3] proposed the approximate version of SUS queries, where mismatches are allowed. This version can be applied in computational biology, where factors such as genetic mutation and experimental error make approximate string matching necessary [9]. They presented an in place algorithm for both exact and approximate versions of the problem. Afterwards, different trade-offs have been presented for k-mismatch SUS problem [9,10,15,16].
In 2018, Inoue et al. [4] proposed a palindromic variant of interval SUS problem. A palindromic string or substring is an important structure in DNA, RNA or protein sequence analysis [22]. In biology data, palindromic structures show the ability of molecules to fold and form double-stranded stems [23]. Given a string S, a shortest unique palindromic substring (SUPS) for an interval [s, t] of S is the shortest substring that is palindromic and unique in S which contains [s, t]. For solving the interval SUPS problem, they preprocess S in O(n) time and space to output SUPSs in O(occ + 1) time. There are other works on SUPS queries by Watanabe et al. [11,12] based on the RLE which are space economical solutions.
In 2019, Abedin et al. [5] focused on the range version of SUS queries (Range-SUS) and generalized the problem. Range queries are a classic data structure topic, which has a great motivation in string processing problems [5,[24][25][26]. Given a range [α, β], the problem is to return a shortest substring with exactly one occurrence in [α, β]. They presented an O(n log n)-word data structure which answers rSUS queries in O(log w n) time per query in the word RAM model, where w = Ω(log n) is the word size [5].
In this survey, we are going to discuss all approaches mentioned above on SUS queries. The focus of this work is on techniques applicable to SUS queries and is to compare all the main results in terms of complexities, restrictions, problem definitions, motivations, and applications. The main papers that we are going to focus on have been categorized in Table 1. In the last section, we discuss the related open questions on the variant topics related to SUS queries.

Definitions
Let S[1, n] be a string of length n (i.e., |S| = n) over an alphabet set Σ and S[i] is the ith character of S. The substring of S which starts at position i and ends at position j of S is denoted by S The k-mismatch left-bounded shortest unique substring LSUS starting at position p, denoted as LSUS p , is a k-mismatch unique substring S[p, j], such that either p = j or any proper prefix of S[p, j] is not k-mismatch unique. k-mismatch RSUS and k-mismatch MUS can be defined similarly.
The longest common prefix of two suffixes S[p, n] and S[q, n] denoted by LCP(S [p, n], S[q, n]), is the longest common prefix between S[p, n] and S[q, n]. The k-mismatch longest common prefix of S[p, n] and S[q, n] denoted as LCP k (S[p, n], S[q, n]) is the longest prefix which has Hamming distance ≤k between two suffixes.

Data Structures
The suffix tree data structure of string S[1, n] is a compact trie of the n suffixes of S appended with a letter $ / ∈ Σ [27]. This suffix tree consists of n leaves (one for each suffix of S) and at most n − 1 internal nodes. The edges are labeled with substrings of S. We denote the suffix tree of string S with ST(s). The Suffix Array of string S of length n is denoted by SA, which is a permutation of {1, · · · , n}, such that SA[i] = j if S[j, n] is the ith lexicographically smallest suffix of S. The Inverse Suffix Array of string S of length n, is a permutation of {1, · · · , n}, such that SA −1 [SA[i]] = i. SA of S can be constructed in linear time and space [28,29]. The longest common prefix (lcp) array of a string S of length n is an integer array of length n such that lcp [1]  Consider s as a subset of {1, 2, . . . , n}. Then, s can be preprocessed into an O(|s|) space data structure, such that for any query p, we can return pred(p, s) and succ(p, s) in O(log log n) time [31], where The eertree of a string S, is a pair of two rooted trees T odd and T even that represent all distinct palindromic substrings of S [32]. T odd and T even store the palindromic substrings of odd and even length, respectively. There is a directed edge (r, a, v) from root r of T odd if v represents a single character a ∈ Σ. For any non-root node u in either T odd or T even , there is a labeled directed edge (u, a, v) from u to v with character label a if aua = v. There are no two out-going edges from a node with the same label.

Motivation
Consider the procedure a search engine performs. Once a search query is given into a search engine by users, all the related pages should be identified and ranked properly. An indexing process is needed to organize information before each search query. There are algorithms such as inverted indexing to keep track of the documents with the pointers to text elements. The modern search engines may use a snippet, which is a short summarized content of a whole website and is shown in the search results. Finding a proper length for a snippet is critical. Either too short length and too long length would be problematic for making the text elements distinguishable and not overwhelming for users [1]. If the snippet for each result of a search is the shortest possible text including the query term and different from all other snippets, the search would be optimized. Thus, providing a fast algorithm for finding a shortest snippet is crucial. In addition to information retrieval purposes, there are some motivations in bioinformatics. In 2005, Haubold et al. [19] explained how the shortest unique substring is a useful construct for alignment-free genome comparison. Unique substrings can help to determine the distinctness and difference between a group of closely related organisms [1,19]. Another application is in event analysis when one wants to understand how an event differs from other events of the same type in a long sequence of historical events while extracting the context of the event. The shortest unique substring of the selected event may be helpful to proceed with the event analysis [1].
Regarding these motivations, Pei et al. [1] introduced the following problem. and a query position p = 5, we need to find a shortest substring S[i, j], with exactly one occurrence in S such that i ≤ 5 ≤ j. One possible output is S [4,6] = baa. Note that we may have multiple answers for Problem 1. S [5,7] = aad is another output.

Suffix Trees Based Approach
Pei et al. [1] stated Problem 1 and presented an algorithm to answer this problem in O(n) time and space. They construct ST(S ) in O(n) space and time [33]. Then, they use ST(S ) to get LSUS(p) in constant time by the following steps:

•
Find the leaf node corresponding to the suffix S[p, n] • If the label of the leaf edge is $, it means that LSUS(p) does not exist and we return null; Otherwise, we continue.

•
Let l be the length of the label of the leaf edge (excluding $).
All of the above steps can be completed in O(1) time using the suffix tree properties. They make clever use of Lemma 1 and, consequently, they can find a SUS covering p by the following Lemma.
They start with assuming that LSUS(p) = S[i, j] is a candidate answer, then they look for a LSUS(k) where k < p and k ≥ j − i with the shortest length. Fore each should be considered as a candidate answer since it should cover p. This is called an extension of an LSUS. Thus, they always make sure that a new candidate covers p. At the end, since there may exist more than one answer, they output the leftmost SUS containing p.
In addition, they show how to preprocess S in O(n 2 ) time and O(n) space to compute the SUS corresponding to every position in the string. By doing so, SUS queries for any position can be answered in constant time. Their technique is based on the fact that each SUS should fall into one of MUS, LSUS, or RSUS. By this observation, SUS corresponding to each position can be precomputed using their propagation procedure [1].
From the space complexity point of view, corresponding to each position p, their algorithm keeps track of a currently shortest MUS that covers p. The total space needed to store this information for all positions is O(n). At the end, they apply their algorithm on real data sets to show the effectiveness of their algorithm. Theorem 1 summarizes their result. Theorem 1. The position-SUS problem can be answered in O(n 2 ) time and O(n) space for every location of a string of length n.

Linear Time Approaches
Ileri et al. [6,7] and Tsuruta et al. [13] independently improved the time complexity of the Theorem 1 from O(n 2 ) to O(n) time. They showed that, by preprocessing the string S of length n in O(n) time, Problem 1 can be answered in constant time, which concludes a linear total time complexity for all positions in the string [6,7,13]. First, we discuss Ileri et al.'s approach.

Ileri et al.'s Framework
Their framework includes two cases; SUS finding for one position, and SUS finding for every position. In the first case, they present an algorithm for finding all the SUSs covering a specific location p in O(n) time. Similar to the approach explained in Section 3.2, they make this observation that each SUS is an extension of an LSUS. Instead of using suffix tree structure to find LSUSs, they use inverse suffix array and lcp array to compute LSUS(i) for i = 1, 2, · · · n as follows: } and null means LSUS does not exist. Using the above equation, for each string position p, they simply compute LSUS(1) · · · , LSUS(p) in O(p) time and maintain the shortest one for position p. Note that, if for some k < p, LSUS(k) does not cover p, we can extend it up to position p. In case of multiple shortest answers, they keep the leftmost candidate. Then, by Lemma 1, they prove their first theorem as follows: ). For any location p in the string S, they can find SUS covering p using O(n) time and space. If multiple answers exist, the leftmost one is returned.
In the second case, they extend their algorithm to find all the SUSs for every location p ∈ {1, 2 · · · n}. Instead of iteratively running the algorithm for finding a SUS of a specific position n times, they use the following lemma to reduce the time complexity to the amortized cost for finding each SUS in O(1) time.

Lemma 2 ([6]
). For any k ∈ {2, 3, · · · , n}, if SUS for position k is an extension of an LSUS, then (1) SUS for position k − 1 must be a substring whose right boundary is the character S[k − 1], and (2) SUS for position k is the substring SUS for position k − 1 appended by the character S[k].
For finding SUS of every position, they begin with SUS of the first position which is LSUS of that position; then, by Lemma 2, they compute SUS of position k using the already calculated SUS of position k − 1. Their algorithm costs O(occ) for reporting the SUSs covering a particular location. By providing an efficient constant time algorithm for computing the shortest LSUS covering each string position, they prove the following theorem. For the implementation, they use libdvsufsort library for implementing the suffix array and lcp array. They compare their results with Tsuruta et al.'s work and shows that, in terms of time complexity, both algorithms have almost the same processing time; however, their space usage is at least four times less for finding a single SUS and two times less for finding all SUSs. Although their algorithm follows a similar technique to Ileri et al. [6], their work was independent. Their main idea is finding SUSs from MUSs using the following lemma. They use lcp array data structure to compute all MUSs of S in linear time. We summarize their results in the following lemma.

In Place and Compact Data Structures' Approaches
The suffix tree of a string S of length n occupies space almost 20 times larger than the space needed to store S, which is n log |Σ| bits. All aforementioned data structures in Section 3.3 require Θ(n) words, where n is the length of the string S. When n is large, the memory usage would be problematic. In order to avoid this issue, Hon et al. [3] proposed a data structure including the input string and two integer arrays for storing the starting positions and ending positions of SUSs. The preprocessing time for their algorithm is linear and takes O(log n) bits of additional working space. If we want to consider the space needed to store all SUSs for all positions, their algorithm takes the least amount of space to do that, which is 2n words to maintain the starting and ending positions of SUSs and another n bytes to store the input string. Previous works need O(n) space while the hidden constant is large; however, that overhead does not exist in Hon et al.'s work. They use suffix arrays instead of suffix trees in their construction. Moreover, they handle all computations in the place of two integer arrays. As a result, their algorithm can find the position-SUS for every string position in O(n) time.
There was still a question of whether we can solve SUS problem in sub-linear space. To answer this question, Ganguly et al. [18] presented the first time-space trade off algorithm which uses O(n/τ) words of additional space. Given a position p ∈ {1 · · · n}, their algorithm answers Problem 1 in O(nτ 2 log n τ ) time, where parameter τ ≥ 1. Another query is reporting SUSs for all positions of S. For this type of query, they present an O(nτ 2 log n) algorithm using O(n/τ) words and 4n + o(n) bits of additional working space. In addition to these deterministic algorithms, they present a randomized algorithm in which the time complexity is O(nτ log c+1 n) by using additional n/ log c n words, where c ≥ 0 is an arbitrary constant. There is a chance of at most n −O(1) that the reported substring is unique and covers the query position but not the shortest one.
In their deterministic algorithms, their intuition is that each SUS for a position k is definitely the shortest unique prefix of S[i, n] for some i ≤ k, or the smallest right extension till position k of such prefix. For each suffix S[i, n], they define LS i as the shortest unique prefix of S[i, n]. In order to reduce the space complexity, instead of storing LS i s for all S[i, n], i ∈ {1 · · · n}, they choose a set of O(n/τ) suffixes and compute the corresponding LSs. After computing all the LSs of the suffixes in the chosen set, using a brute force approach, for any suffix S[j, n], they can compute LS j . Their results are summarized in Theorem 5.  [14] presented a data structure of size (log 2 3 + 1)n) + o(n) bits, which can answer Problem 1 in O(occ) time, where occ is the number of SUSs for the given query point. The main intuition of their algorithm is based on Lemma 3. Their data structure includes two bit arrays MB S and ME S , each of length n, to keep track of the starting and ending positions of the MUSs . As a result, they prove that there exists a data structure of size (log 2 3 + 1)n) + o(n) bits that can answer SUSs in O(occ) time. Given the bit arrays MB S and ME S , the data structure can be constructed in O(n) time using 3n + n log 2 3 + o(n) bits of total working space.

Interval-SUS Queries
In this section, we discuss a generalization of position-SUS problem which is defined as follows: and an interval query position [14,16], we need to find a shortest substring S[i, j], with exactly one occurrence in S such that i ≤ 14 < 16 ≤ j. The output here is S [14,17] = daaa which includes S [14,16] = daa and it does not have any other occurrence in S.   Let M be the set of all MUSs in S. For presenting their data structure, they define a set of intervals denoted by I, such that [i, j] ∈ I iff S[i, j] is in M. Then, they reduce the interval-SUS problem to Containment Min. Given an interval [x, y], a containment min query returns the shortest interval in I which contains [x, y]. Hu et al. [2] proposed a data structure of size O(n) in which RMQ is performed to answer a containment min query in O(1) time.
In addition, they show how to report all SUSs in the case that interval SUS has more than one answer. For this purpose, they use an auxiliary problem which is defined below. The mentioned algorithms are in the RAM model. They also consider solving Problem 2 in the standard external memory model [35]. They follow a similar technique. Assume that O(SORT(n)) is the number of I/Os needed to sort n elements and B is the number of words in a block of a disk. They pre-compute an index structure from S in O(SORT(n)) I/Os in external memory that occupies O(n/B) blocks and can answer any shortest unique substring query in O(1) I/Os [2].

RLE-Based Approaches
The Run Length Encoding (RLE) of a string is a compressed representation in which each maximal run of a character c of length is encoded as c . For instance, the RLE of string aaaaaaabbbbaaac is a 7 b 4 a 3 c 1 . Mieno et al. [8] considered solving Problem 2 in the case where the input string is given in RLE representation. Their motivation was to reduce space and time complexity of processing the input string. They presented a data structure of size O(m + π s (N, m)) that can be constructed in O(m log m + π c (N, m)) time to answer interval-SUS queries in O(π q (N, m) + occ) time, where occ is the output size, and π s (N, m), π c (N, m) and π q (N, m) are the size, construction time and the query time for a predecessor/successor query on m elements from a universe of [1, N]. In their approach, they use combinatorial properties on MUSs and RLE strings. Let m be the length of the string S in RLE representation. We denote RLE representation of S by RLE(S ). They show that the number of MUSs in RLE(S ) is 2m − 1. Thus, instead of dealing with Θ(n) MUSs like the previous related results, a considerable amount of space and time would be reduced if there exist runs in the string. By doing so, they were able to build an O(m) size data structure for the RLE version of SUS problem, which is formally defined as follows: For solving Problem 3, they show how to precompute all the MUSs using a specific type of suffix arrays for RLE strings [36]. Before discussing their techniques, we bring some definitions. Let bpos(i), epos(i) and exp(i) be the beginning position, ending position, and exponent of the ith run of RLE(S ). Let P be be any subset of positions of S. The sparse suffix array of S w.r.t P denoted by SSA P is an array of size |P| such that SSA P [i] ∈ P for all 1 ≤ i ≤ |P| and also like a normal suffix array SSA P [i] is lexicographically smaller than SSA P [i + 1]. They use trancated RLE suffix array for RLE(S ) denoted by tRLESA, in which P is the set of epos(i) for 1 ≤ i ≤ m. In addition, let EXP be an array of length m such that EXP[i] = exp(k), where tRLESA[i] = epos(k). Given RLE(s), all the defined arrays can be computed in O(m log m) time with O(m) working space [14].
In order to compute MUSs from RLE(S ), they consider three disjoint partitions of MUS, M 1 , M 2 , M 3 . The first partition consists of MUSs which are contained in runs. The second one is MUSs that start at the last characters of runs, and all the other MUSs considered to be in the third partitions. By the fact that each MUS cannot be a proper substring of another MUS and discussing the size of each partition separately, they prove that |M| ≤ 2m − 1. Then, they show how to find

Compact Data Structures' Approaches
In Section 3.4, we have discussed Mieno et al's work [8] for solving Problem 1. They also presented a compact data structure for Problem 2 [14]. Their data structure has the size of 2n + 2m + o(n) bits and outputs an interval SUS query in O(occ) time, where m is the number of minimal unique substrings (MUSs) of the input string which is at most n. Their technique is based on Mieno et al.'s work [8], which is discussed in Section 4.2. Similar to the technique for solving position-RSUS, their structure is based on the two bit arrays MB and ME. In addition, they use array MUS len which has been used in Mieno et al.'s work for storing the length of the MUSs [8]. On the top of MB and ME, a successor and a predecessor data structures are maintained. MUS len is also endowed with RMQ data structure. Now, once an interval query [s, t] comes, first we can find g = pred ME (y) and r = succ MB (s), then we can find the range of SUS covering [s, t]. Consequently, it is sufficient to answer RMQ queries on the corresponding range on MUS len . The space needed for MB and ME and the predecessor and successor data structure is 2n + o(n) bits. The RMQ data structure on MUS len takes 2m + o(m) bits of space. Thus, using the constant query time of the RMQ data structure (see Section 2.2), their results can be summarized as follows: There exists a data structure of size 2n + 2m + o(n) bits that can answer interval SUS problem in O(occ) time, where occ is the number of the answers corresponding to the given interval.

Motivation
In molecular biology, shortest unique substrings found in DNA sequences can be used to compare similar organisms and determine unique patterns. It also helps to design polymer chain reaction (PCR) [1,15]. If we just consider exact shortest unique substrings while comparing distinct organisms, possible patterns might be disregarded due to errors or mutations. In this section, we discuss the approximate version of SUS problem which is proposed by Hon et al. [3], in which the uniqueness constraint is more strict. In this variant of SUS problem, the unique substrings are allowed for up to k mismatches. This version can be applied in computational biology, where factors such as genetic mutation and experimental error make approximate string matching necessary [9]. Another useful application of this approximate version is in computing average common substring, which has been considered as an approach to phylogenomic reconstruction [37,38]. In order to estimate the evolutionary distance between pairs of primate genomes, Thankachan et al. [39] showed that adding a similar k-mismatch parameter to average common substring finding equation leads to better results [9]. A k-mismatch shortest unique substring covering a position p, denoted by SUS k p , is a k-mismatch unique substring S[i, j] such that satisfies the condition i ≤ p ≤ j and there is no other k-mismatch unique substring with shorter length which satisfies the condition. Note that, similar to the definition of k-mismatch SUS in Section 2.1, we also consider the Hamming distance for the k-mismatch SUS problem. The problem is formally defined as follows: i . This step also takes linear time for any k ≥ 0. Note that only one stage requires the quadratic time. They implemented their algorithm in C using libdivsu f sort library for the suffix array construction. Their contribution can be summarized as follows: Theorem 8. Using an additional 2n words each of size log 2 (n) bits, and n bytes of space for storing the input string, Hon et al. [3] provided a theoretically in-place framework to solve both exact and the approximate position SUS using a total of O(n) and O(n 2 ) time, respectively.

Allen et al.'s Framework
In order to reduce the quadratic time complexity for solving k-mismatch SUS queries, Allen et al. [9] presented an average time complexity of O(n log k n) using O(kn) space. They follow the technique provided by Thankachan et al. [39] for solving k-mismatch average common substring.
Let the k-mismatch left-bounded longest repeat starting at position i denoted by LLR k i be a k-mismatch repeat substring S[i, j] such that either j = n or S[i, j + 1] is a k-mismatch unique. Consider L as an array of length n where L[i] = |LLR k i |. Now, the algorithm discussed in Section 5.2 can be applied to compute SUS k i for each position i in S in linear time and space. Computing array L can be a substitution for stage one of the Hon et al.'s work [3]. To avoid O(n 2 ) time, instead of comparing every pair of suffixes of S, they presented an expected O(m log k m) time for calculating each entry of L using the technique in [39], where m is the total length of two suffixes used in computing L[i] as follows: The total time complexity to compute every entry of L would be expected O(n log k n). In 2020, Allen et al. [10] published a new version of their work which provides algorithmic bounds for k-mismatch SUS problem. They presented an algorithm which can solve Problem 4 using O(n log k n) time and O(n) space. This worst-case bound is asymptotically much better than the practical algorithm. In comparison to Hon et al's experimental results [3], their new implementation shows that the practical algorithm is easy to implement and it takes less time when k is small relative to n.

GPU Based Approach
All the works that have been discussed on k-mismatch SUS query were focusing on improving the time and space complexity for SUS computation in the sequential CPU model. In 2018, Schultz and Xu [15] presented the first parallel approach for k-mismatch SUS problem in the shared-memory model, particularly leveraging on the massive multi-threading GPU technology. Obviously, this approach is experimentally faster than the CPU solutions. Their experimental results on a mid-end GPU show that the GPU approach is at least six times faster than the CPU approach in the case of k = 0 (exact match), and at least 23 times faster in the general case (k > 0). This is essential when the input string is massively long as genomic sequences are studied. In terms of memory usage, this approach is almost the same as the sequential CPU approaches. Their algorithms totally happen on GPU except a transferring input string to GPU, which is done by the CPU host. When there are no mismatches (k = 0), they implicitly use the efficient data structures like suffix array and lcp array. However, when k > 0, using these data structures would be problematic since they do not have enough information such as k-mismatch LCPs. To solve this issue, Schultz and Xu [15] designed a method that can be parallelized in the GPU architecture.
In 2019, they published an extension of their paper [16], providing more experimental results on exact SUS and approximate SUS computation. They show the speedup gained by the GPU-based approach against the sequential solution, including or excluding suffix array construction. It is important to observe the difference of speedup in case of whether the suffix array is given or not. Corresponding to their results, almost 50% of the time of the sequential approach is to spend on suffix array construction. Thus, if the suffix array is given, it can be copied to the GPU memory directly to achieve a better speedup.

Motivation
We call a substring S[i, j] is a palindrome if it is identical with its reverse. There have been many studies on palindromic strings and their combinatorial properties [32,40,41]. A palindromic string or substring is an important structure in DNA, RNA, or protein sequence analysis [22]. In biology data, palindromic structures show the ability of molecules to fold and form double-stranded stems [23]. In addition, by the similar palindromic structures of the protein strands, we can guess their similar secondary structures. Another application of palindromic substructures is in gene editing and gene regulation in species [22,42]. In this section, we are going to discuss a new version of SUS queries named shortest unique palindromic substring(SUPS) problem. A shortest unique palindromic substring SUPS for an interval [s, t] is the shortest substring S[i, j] such that it is unique in S and contains [s, t], and any other palindromic substring of S, which contains [s, t] and is shorter than S[i, j], which occurs more than once in S. The SUPS problem is formally defined as follows: , given the query interval [8,9], SUPS problem outputs S [6,9] = adda, which is the shortest palindrome containing S [8,9] = da.

Optimal Approaches
In 2018, Inoue et al. [4] introduced Problem 5, which is a version of SUS queries focusing on palindromes. They propose an optimal algorithm which takes O(n) time to preprocess the string S and can return all SUPS for any interval query in O(occ +  ). Using this fact, they could preprocess S in linear time to report all SUPSs.
In addition, they provided some bounds on the number of SUPSs. They showed the maximum number of point and interval SUPS in S, which is a string from an arbitrary size alphabet, is at most n. For a binary alphabet, they gave a lower bound for the number of point SUPS. Let P S S and IS S be the set of substrings of S where each of the substrings in P S S is an SUPS for some point and each of the substrings in IS S is an SUPS for some interval.

Theorem 9 ([4]
). There exists a binary string S k of length 6k + 8 such that P S S k = IS S k = 4k + 6 for any k ≥ 0. Thus, S k contains at least 2 3 n point and interval SUPSs where n = |S k | In the end, they discuss the number of point and interval SUPSs on palindromic rich strings. A string is a palindromic rich if it has n + 1 distinct palindromic substring including an empty string. They consider the palindromic rich string R k = a 1 b 1 · · · a k b k and prove two following theorems.

Theorem 10 ([4]
). There exists a binary string of length n such that the number of interval SUPSs is n − √ 1 + 4n + 3.

RLE-Based Approaches
Watanabe et al. [11] considered the shortest unique palindromic substring when the input string is given in its RLE representation. They showed how to preprocess a given RLE(S) of length m in O(m) space and O(m log σ RLE S + m log m/ log log m) time, where σ RLE s is the umber of distinct runs of RLE s . Their work is the first space-economial for SUPS problem and can answer queries in O( log m/ log log m + occ) time.
Similar to the technique for solving the normal SUPS problem which is discussed in the previous subsection, they compute MUPSs of S in the preprocessing step. Their idea is based on the fact that the number of MUPSs of any string S is at most m [11]. They first compute the palindromic substrings whose center is the same as the center for some run in the string. These palindromic substrings are called run-centered palindromes. In order to compute these substrings, they utilize Manacher's algorithm [43]. This algorithm scans the string and constructs an array MaxPal of length 2n + 1 in which the ith entry is the length of the maximal palindrome with center i. Using this technique, they compute all the RLE-maximal palindromes of S in O(m) time and space.
In order to improve the space complexity, they build a data structure which is a modified eertree of the input string to deal with the run-length encoded string instead of the original one. They prove that this new data structure, RLE-eertree (e 2 rtre 2 ), has 2m + 1 nodes which is linear to the length of the RLE(S ). Thus, it takes less space complexity compared with the original eertree. The construction time for e 2 rtre 2 is O(m log σ RLE ). They used this tree to compute all the MUSPSs of S. Using the list of MUSPSs, they present their algorithm for SUPS queries. Their technique is similar to Inoue et al.'s work [4] with the difference that the space complexity should be O(m).
In 2020, Watanabe et al. [12] published an extension of their work by considering a variant of SUPS problem where a query interval is also given in a run-length encoded representation. Their technique is similar to their previous approach [11]. They used combinatorial properties of maximal palindromes and presented an O(m) space data structure to answer queries in O(log log m + occ) time [12].

Range-SUS
Range queries are a classic data structure topic, which has a great motivation in string processing problems [5,[24][25][26]. In 2019, Abedin et al. [5] studied the local shortest unique substring which starts in a specific region or range of the string. Given a range [α, β], the problem is to return a shortest substring If α = β, the answer (α, 1) is trivial. Thus, in the rest, we assume that α < β.  16], we need to find a shortest substring of S with exactly one occurrence in [5,16]. The output here is (p, ) = (10, 2) since S [10,11] = ac is the shortest substring of S with exactly one occurrence in [5,16].
For each position k ∈ [1, n], Abedin et al. [5] provide a data structure to keep track of the last and next occurrence of substring S[k, k + h − 1], denoted by Prev(k, h) and Next(k, h), respectively. They define λ(a, b, k) and C k as follows: λ(a, b, k) = min{h | Prev(k, h) < a and Next(k, h) > b}.
Their main result is providing an upper bound on the size of all C k s, which is given in the following lemma.  λ(a, b, k) is the length of the shortest substring that starts at position k, and C k is the set of candidate length for the RSUS answer, which starts at position k.
Abedin et al. solved rSUS queries by reducing the problem to an top-1 rectangle stabbing query on a set of rectangles with input point (α, β). Assume that the answer for Problem 6 is (p, ). Given a query range [α, β], the answer (p, ) we are looking for is the pair (k, h) with the minimum h under the following conditions: k ∈ [α, β], h ∈ C k , Prev(k, h) < α and Next(k, h) > β. Equivalently, (p, ) is the pair (k, h) with the minimum h, such that h ∈ C k , α ∈ (Prev(k, h), k], and β ∈ [k, Next(k, h)).
In the preprocessing step, they map each h ∈ C k into a weighted rectangle R k,h with weight h and defined as follows: After query [α, β] comes, the lowest weighted rectangle which stabbed by the point (α, β) is R p, . By combining the optimal data structure for top-1 rectangle stabbing presented by Chan et al. [44] and the bound on the number of candidate lengths in Lemma 6, they prove the following result.

Theorem 12.
There exists an O(n log n)-word data structure which answers rSUS queries in O(log w n) time per query in the word RAM model, where w = Ω(log n) is the word size.

Discussion and Future Work
In this paper, we reviewed several types of shortest unique substring queries and their corresponding solutions. All the discussed problems in this manuscript are highly motivated topics in string processing and computational biology research areas. Although we discussed more than 10 approaches for variants of SUS queries, there still exist related topics which have not been studied, or there is no efficient algorithm to solve them. In this section, we discuss some of such topics and open questions for future work:

•
We discussed all the solutions to solve approximate SUS queries in Section 5. However, there is no efficient in-place algorithm which can find LSUSs to get SUSs afterward. Another technique that can be applied to solve approximate SUS queries is considering the RLE representation of the input string. Section 4.2 shows this technique for solving interval-SUS queries. To our knowledge, an RLE based approach for solving approximate SUS queries has not been studied. In addition, there is no work considering the standard external memory model for solving an approximate SUS problem. As the I/O-efficient construction of the suffix array and lcp array exist [45][46][47][48], it seems to be possible to change the RAM model algorithm for the construction of these arrays to the external memory model. • In Section 4.2, the π q (N, m) in the query time of Theorem 6 is log m/ log log m, which is actually the time for performing dynamic predecessor/successor queries using O(|RLE(S)) space [8]. In order to make the query time faster using the same space, the question is if there exists a data structure of size O(|RLE(S)|) that can efficiently answer Problem 3 without using predecessor/successor.

•
As we discussed in Section 6, palindromic substrings have great motivations in computational biology. All the reviewed works are on finding the exact SUPSs. Similar to the approximate SUS problem, approximate SUPS query is also important to be studied for considering errors and mutations. Besides the definition of Problem 5, the following definition has a great motivation in bioinformatics. A nucleotide sequence is considered as a palindrome if the reverse of its complementary strand is equal to the original sequence [49]. The question is if the methods discussed in Section 6 can be applied to efficiently solve this problem.

•
The last topic that we discussed was the rSUS problem. According to Theorem 12, rSUS queries can be solved in O(log ω n) time using a data structure of size O(n log n) word. The question is whether we can design an efficient O(n)-word data structure for the rSUS problem. In addition, the approximate version of rSUS queries has not been studied. It is possible to combine the technique discussed in Section 7 and the framework of Thankachan et al. [50] to provide an efficient algorithm for approximate rSUS problem.

•
Besides shortest unique substrings, Maximal Unique Matches is an important concept in computational biology for aligning two long genome sequences [51]. Ganguly et al. [18] applied a similar technique discussed in Section 3.4 to find maximal unique matches of two strings. As far as we are aware, the dynamic version (when mismatches are allowed) of this problem has not been studied yet. We believe that, by modifying the techniques on the dynamic longest common substring problem (LCS after k mismatches) [52][53][54], the approximate Maximal Unique Matches problem can be solved in subquadratic time.

Conflicts of Interest:
The authors declare no conflict of interest.