Abstract
The shortest unique substring (SUS) problem is an active line of research in the field of string algorithms and has several applications in bioinformatics and information retrieval. The initial version of the problem was proposed by Pei et al. [ICDE’13]. Over the years, many variants and extensions have been pursued, which include positional-SUS, interval-SUS, approximate-SUS, palindromic-SUS, range-SUS, etc. In this article, we highlight some of the key results and summarize the recent developments in this area.
1. Introduction
Let be a string of length n and be the substring which starts at position i and ends at position j of . The substring is a repeat if it occurs more than once in ; otherwise, it is a unique substring of . Since finding the shortest unique substrings is a non-trivial problem that has several applications for different purposes, variants of this problem have been studied. Table 1 shows the variants of the Shortest Unique Substring () problem that we focused on in this survey.
Table 1.
This table categorizes the main papers which are reviewed and discussed in this survey.
In 2005, Haubold et al. [19] explained how the shortest unique substring is a useful construct for alignment-free genome comparison. Unique substrings can help to determine the distinctness and difference between a group of closely related organisms [1,19]. In addition, an algorithm for finding a unique substring can be helpful to build a unique genetic fingerprint from a DNA sample or can help designing polymerase chain reaction (PCR) primer technique in molecule biology [1,20]. In 2015, Adas et al. [21] investigated the usage of shortest unique substrings for alignment and compression of DNA sequences. In addition to the applications in bioinformatics, the shortest unique substrings can be used in information retrieval for document search. The position based shortest unique substring (position-) queries was first proposed by Pei et al. [1]. Given a string of length n and a query point p in the string, the problem is to find a shortest unique substring covering p. They presented an algorithm which costs time and space. In 2014, several publications have revisited this problem. İleri et al. [6] and Tsuruta et al. [13] proposed an optimal time and space algorithm for solving this problem. In addition, Hon et al. [3] and Ganguly et al. [18] presented compact data structures that can answer a position- query using less than words of space, where a word size is . Several variations of the original problem have been proposed in the following years.
In 2014, Hu et al. [2] generalized the position- problem by considering the positions in an interval rather than a single position in the string. For the interval- problem, they preprocess the input string in time and space to answer any interval query in constant time and return all s in time, where is the number of outputs. In 2019, Mieno et al. [14] improved the space complexity of this problem by providing a compact data structure.
In 2016, Mieno et al. [8] considered queries on run-length encoded(RLE) strings. Their motivation was to reduce space and time complexity of processing the input string. They showed how to construct a data structure of size in time that can answer interval- queries in time, where is the output size, and , and are the size, construction time and the query time for a predecessor/successor query on m elements from a universe of .
In 2017, Hon et al. [3] proposed the approximate version of queries, where mismatches are allowed. This version can be applied in computational biology, where factors such as genetic mutation and experimental error make approximate string matching necessary [9]. They presented an in place algorithm for both exact and approximate versions of the problem. Afterwards, different trade-offs have been presented for k-mismatch problem [9,10,15,16].
In 2018, Inoue et al. [4] proposed a palindromic variant of interval problem. A palindromic string or substring is an important structure in DNA, RNA or protein sequence analysis [22]. In biology data, palindromic structures show the ability of molecules to fold and form double-stranded stems [23]. Given a string , a shortest unique palindromic substring () for an interval of is the shortest substring that is palindromic and unique in which contains . For solving the interval problem, they preprocess in time and space to output s in time. There are other works on queries by Watanabe et al. [11,12] based on the which are space economical solutions.
In 2019, Abedin et al. [5] focused on the range version of queries (Range-) and generalized the problem. Range queries are a classic data structure topic, which has a great motivation in string processing problems [5,24,25,26]. Given a range , the problem is to return a shortest substring with exactly one occurrence in . They presented an -word data structure which answers queries in time per query in the word RAM model, where is the word size [5].
In this survey, we are going to discuss all approaches mentioned above on queries. The focus of this work is on techniques applicable to queries and is to compare all the main results in terms of complexities, restrictions, problem definitions, motivations, and applications. The main papers that we are going to focus on have been categorized in Table 1. In the last section, we discuss the related open questions on the variant topics related to queries.
2. Preliminaries
2.1. Definitions
Let be a string of length n (i.e., ) over an alphabet set and is the ith character of . The substring of which starts at position i and ends at position j of is denoted by . We have . If , then is an empty string. A prefix of is a substring of for some . is a proper prefix if . A suffix of is a substring for some . It is a proper suffix if . We say covers position p, if . is a proper substring of if and . A substring is a unique substring if there is no other substring such that , where = indicates the identicality of two strings. The shortest unique substring () covering position p is a unique substring of that contains , and there is no other unique substring with shorter length containing . Note that there might exist more than one covering position p. is a minimal unique substring () if it is a unique substring and there is no proper substring of that is also unique. is called the left-bounded for position p, , if is unique and there is no other substring which is also unique for . Symmetrically, is right-bounded for position p, , if it is a unique substring and there is no other unique substring , where .
denotes the Hamming distance between two strings and of equal length, defined as the number of string positions where characters differ. is a k-mismatch unique substring if there is no substring such that and . The k-mismatch left-bounded shortest unique substring starting at position p, denoted as , is a k-mismatch unique substring , such that either or any proper prefix of is not k-mismatch unique. k-mismatch and k-mismatch can be defined similarly.
The longest common prefix of two suffixes and denoted by , is the longest common prefix between and . The k-mismatch longest common prefix of and denoted as is the longest prefix which has Hamming distance ≤k between two suffixes.
2.2. Data Structures
The suffix tree data structure of string is a compact trie of the n suffixes of appended with a letter [27]. This suffix tree consists of n leaves (one for each suffix of ) and at most internal nodes. The edges are labeled with substrings of . We denote the suffix tree of string with . The Suffix Array of string of length n is denoted by , which is a permutation of , such that if is the ith lexicographically smallest suffix of . The Inverse Suffix Array of string of length n, is a permutation of , such that . of can be constructed in linear time and space [28,29]. The longest common prefix () array of a string of length n is an integer array of length n such that and stores the length of the longest common prefix between and . Given the suffix array of , array can be constructed in time. Let be an array of length n. A range minimum query (RMQ) with an input range asks to report . By maintaining a data structure of size bits, any RMQ on A can be answered in time [30] (even without accessing A).
Consider s as a subset of . Then, s can be preprocessed into an space data structure, such that for any query p, we can return and in time [31], where
The eertree of a string , is a pair of two rooted trees and that represent all distinct palindromic substrings of [32]. and store the palindromic substrings of odd and even length, respectively. There is a directed edge from root r of if v represents a single character . For any non-root node u in either or , there is a labeled directed edge from u to v with character label a if . There are no two out-going edges from a node with the same label.
3. Position-SUS Queries
3.1. Motivation
Consider the procedure a search engine performs. Once a search query is given into a search engine by users, all the related pages should be identified and ranked properly. An indexing process is needed to organize information before each search query. There are algorithms such as inverted indexing to keep track of the documents with the pointers to text elements. The modern search engines may use a snippet, which is a short summarized content of a whole website and is shown in the search results. Finding a proper length for a snippet is critical. Either too short length and too long length would be problematic for making the text elements distinguishable and not overwhelming for users [1]. If the snippet for each result of a search is the shortest possible text including the query term and different from all other snippets, the search would be optimized. Thus, providing a fast algorithm for finding a shortest snippet is crucial. In addition to information retrieval purposes, there are some motivations in bioinformatics. In 2005, Haubold et al. [19] explained how the shortest unique substring is a useful construct for alignment-free genome comparison. Unique substrings can help to determine the distinctness and difference between a group of closely related organisms [1,19]. Another application is in event analysis when one wants to understand how an event differs from other events of the same type in a long sequence of historical events while extracting the context of the event. The shortest unique substring of the selected event may be helpful to proceed with the event analysis [1].
Regarding these motivations, Pei et al. [1] introduced the following problem.
Problem 1.
Position- Queries
Input: String and a position .
Output: covering p: Substring containing position p, i.e., , such that is unique and as short as possible
Example 1.
Givenand a query position , we need to find a shortest substring , with exactly one occurrence in such that . One possible output is . Note that we may have multiple answers for Problem 1. is another output.
3.2. Suffix Trees Based Approach
Pei et al. [1] stated Problem 1 and presented an algorithm to answer this problem in time and space. They construct in space and time [33]. Then, they use to get in constant time by the following steps:
- Find the leaf node corresponding to the suffix
- If the label of the leaf edge is $, it means that does not exist and we return null; Otherwise, we continue.
- Let l be the length of the label of the leaf edge (excluding $).
- is
All of the above steps can be completed in time using the suffix tree properties. They make clever use of Lemma 1 and, consequently, they can find a covering p by the following Lemma.
Lemma 1
([1,6]). Every is either an or an extension of an .
They start with assuming that is a candidate answer, then they look for a where and with the shortest length. Fore each , if , instead of , should be considered as a candidate answer since it should cover p. This is called an extension of an . Thus, they always make sure that a new candidate covers p. At the end, since there may exist more than one answer, they output the leftmost containing p.
In addition, they show how to preprocess in time and space to compute the corresponding to every position in the string. By doing so, queries for any position can be answered in constant time. Their technique is based on the fact that each should fall into one of , , or . By this observation, corresponding to each position can be precomputed using their propagation procedure [1].
From the space complexity point of view, corresponding to each position p, their algorithm keeps track of a currently shortest that covers p. The total space needed to store this information for all positions is . At the end, they apply their algorithm on real data sets to show the effectiveness of their algorithm. Theorem 1 summarizes their result.
Theorem 1.
The position- problem can be answered in time and space for every location of a string of length n.
3.3. Linear Time Approaches
Ileri et al. [6,7] and Tsuruta et al. [13] independently improved the time complexity of the Theorem 1 from to time. They showed that, by preprocessing the string of length n in time, Problem 1 can be answered in constant time, which concludes a linear total time complexity for all positions in the string [6,7,13]. First, we discuss Ileri et al.’s approach.
3.3.1. Ileri et al.’s Framework
Their framework includes two cases; finding for one position, and finding for every position. In the first case, they present an algorithm for finding all the s covering a specific location p in time. Similar to the approach explained in Section 3.2, they make this observation that each is an extension of an . Instead of using suffix tree structure to find s, they use inverse suffix array and array to compute for as follows:
where and means does not exist. Using the above equation, for each string position p, they simply compute in time and maintain the shortest one for position p. Note that, if for some , does not cover p, we can extend it up to position p. In case of multiple shortest answers, they keep the leftmost candidate. Then, by Lemma 1, they prove their first theorem as follows:
Theorem 2
([6]). For any location p in the string , they can find covering p using time and space. If multiple answers exist, the leftmost one is returned.
In the second case, they extend their algorithm to find all the s for every location . Instead of iteratively running the algorithm for finding a of a specific position n times, they use the following lemma to reduce the time complexity to the amortized cost for finding each in time.
Lemma 2
([6]). For any , if for position k is an extension of an , then (1) for position must be a substring whose right boundary is the character , and (2) for position k is the substring for position appended by the character .
For finding of every position, they begin with of the first position which is of that position; then, by Lemma 2, they compute of position k using the already calculated of position . Their algorithm costs for reporting the s covering a particular location. By providing an efficient constant time algorithm for computing the shortest covering each string position, they prove the following theorem.
Theorem 3.
All s corresponding to all the positions of string can be computed in time and space.
For the implementation, they use libdvsufsort library for implementing the suffix array and lcp array. They compare their results with Tsuruta et al.’s work and shows that, in terms of time complexity, both algorithms have almost the same processing time; however, their space usage is at least four times less for finding a single and two times less for finding all s.
3.3.2. Tsuruta et al.’s Framework
Tsuruta et al. [13] improved Pei et al.’s (ICDE 2013) upper bounds for the shortest unique substring problem as well. They consider two types of queries. The first one computes a for any position in the string in constant time after preprocessing in time. Thus, reporting a for all positions in takes time in total. Their next algorithm outputs all the s covering a query position. Note that Pei et al.’s algorithm reports only one for a query position. By preprocessing in time, Tsuruta et al. provided an algorithm which reports all s containing a query position in time, where is the number of output.
Although their algorithm follows a similar technique to Ileri et al. [6], their work was independent. Their main idea is finding s from s using the following lemma.
Lemma 3
([13]). Every position- contains exactly one .
They use array data structure to compute all s of in linear time. We summarize their results in the following lemma.
Lemma 4.
All the , , and can be computed from in time.
In order to compute s from s, they define the concept of meaningful and meaninglesss, where meaningful s drive s corresponding to some positions. Then, they present an algorithm to collect all the s from the meaningful s in linear time. Theorem 4 summarizes their results.
Theorem 4.
A string S of length n can be preprocessed in time and space so that the shortest unique substring queries can be answered in time, where occ is the number of shortest substrings returned. Notably, outputting a single can be done in time.
3.4. In Place and Compact Data Structures’ Approaches
The suffix tree of a string of length n occupies space almost 20 times larger than the space needed to store , which is bits. All aforementioned data structures in Section 3.3 require words, where n is the length of the string . When n is large, the memory usage would be problematic. In order to avoid this issue, Hon et al. [3] proposed a data structure including the input string and two integer arrays for storing the starting positions and ending positions of s. The preprocessing time for their algorithm is linear and takes bits of additional working space. If we want to consider the space needed to store all s for all positions, their algorithm takes the least amount of space to do that, which is words to maintain the starting and ending positions of s and another n bytes to store the input string. Previous works need space while the hidden constant is large; however, that overhead does not exist in Hon et al.’s work. They use suffix arrays instead of suffix trees in their construction. Moreover, they handle all computations in the place of two integer arrays. As a result, their algorithm can find the position- for every string position in time.
There was still a question of whether we can solve problem in sub-linear space. To answer this question, Ganguly et al. [18] presented the first time-space trade off algorithm which uses words of additional space. Given a position , their algorithm answers Problem 1 in time, where parameter . Another query is reporting s for all positions of . For this type of query, they present an algorithm using words and bits of additional working space. In addition to these deterministic algorithms, they present a randomized algorithm in which the time complexity is by using additional words, where is an arbitrary constant. There is a chance of at most that the reported substring is unique and covers the query position but not the shortest one.
In their deterministic algorithms, their intuition is that each for a position k is definitely the shortest unique prefix of for some , or the smallest right extension till position k of such prefix. For each suffix , they define as the shortest unique prefix of . In order to reduce the space complexity, instead of storing s for all , , they choose a set of suffixes and compute the corresponding s. After computing all the s of the suffixes in the chosen set, using a brute force approach, for any suffix , they can compute . Their results are summarized in Theorem 5.
Theorem 5
([18]). For any given string position , an covering p can be computed in time and additional working space. In addition, computing any for every position in the string can be reported in time and additional words and bits of additional working space.
Ganguly et al.’s algorithm reports only one for each given query. In 2019, Mieno et al. [14] presented a data structure of size bits, which can answer Problem 1 in time, where is the number of s for the given query point. The main intuition of their algorithm is based on Lemma 3. Their data structure includes two bit arrays and , each of length n, to keep track of the starting and ending positions of the s. As a result, they prove that there exists a data structure of size bits that can answer s in time. Given the bit arrays and , the data structure can be constructed in time using bits of total working space.
4. Interval-SUS Queries
In this section, we discuss a generalization of position- problem which is defined as follows:
Problem 2.
Interval Queries
Input: String and a query interval .
Output: All s of containing
Example 2.
Givenand an interval query position , we need to find a shortest substring , with exactly one occurrence in such that . The output here is which includes and it does not have any other occurrence in .
4.1. Linear Time Approaches
Hu et al. [2] generalized Problem 1 to Problem 2 and presented a linear time and space data structure which can solve Problem 2 in time, where is the number of all s containing the query interval. The interval problem is more difficult to be solved in time, since there exist intervals and we cannot store the candidate answer for every possible interval. In order to deal with this issue, Hu et al. use the concept of , , , and Corollary 1 to prove Lemma 5.
Corollary 1
([34]). There exists a data structure of size which can be constructed in time that can check whether a given substring of is unique in time.
Lemma 5
([34]). The answer of the interval- problem with the query interval must be the shortest of the following candidates:
- 1.
- if it is unique. This can be checked in constant time and linear space.
- 2.
- : This can be computed in constant time using linear space by Lemma 4.
- 3.
- : This can be computed in constant time using linear space by Lemma 4.
- 4.
- The shortest containing : It remains to show their structure of computing this candidate.
Let M be the set of all s in . For presenting their data structure, they define a set of intervals denoted by I, such that iff is in M. Then, they reduce the interval- problem to Containment Min. Given an interval , a containment min query returns the shortest interval in I which contains . Hu et al. [2] proposed a data structure of size in which RMQ is performed to answer a containment min query in time.
In addition, they show how to report all s in the case that interval has more than one answer. For this purpose, they use an auxiliary problem which is defined below.
Position Constraint Query. Given a substring and two ranges and both in , the problem is to return a unique substring with the minimum length that contains such that and .
This problem is a constraint version of the interval problem where the starting and ending positions of the answer substring should be contained in the two intervals specified in the problem. They start with the normal interval- query. Given the input interval , let be the answer of the interval-. They consider two intervals and . Now, using two position constraint queries, the next answers can be found. First, they run a position constrained query with , and to return the other possible answers. The other query corresponding to is symmetrical. In total, their algorithm takes time to report all the s.
The mentioned algorithms are in the RAM model. They also consider solving Problem 2 in the standard external memory model [35]. They follow a similar technique. Assume that is the number of I/Os needed to sort n elements and B is the number of words in a block of a disk. They pre-compute an index structure from in I/Os in external memory that occupies blocks and can answer any shortest unique substring query in I/Os [2].
4.2. RLE-Based Approaches
The Run Length Encoding (RLE) of a string is a compressed representation in which each maximal run of a character c of length ℓ is encoded as . For instance, the of string is . Mieno et al. [8] considered solving Problem 2 in the case where the input string is given in representation. Their motivation was to reduce space and time complexity of processing the input string. They presented a data structure of size that can be constructed in time to answer interval- queries in time, where is the output size, and , and are the size, construction time and the query time for a predecessor/successor query on m elements from a universe of . In their approach, they use combinatorial properties on s and strings. Let m be the length of the string in representation. We denote representation of by . They show that the number of s in is . Thus, instead of dealing with s like the previous related results, a considerable amount of space and time would be reduced if there exist runs in the string. By doing so, they were able to build an size data structure for the version of problem, which is formally defined as follows:
Problem 3.
RLE- Queries
Preprocess:
Query: An interval
Return: All of containing the query interval
For solving Problem 3, they show how to precompute all the s using a specific type of suffix arrays for strings [36]. Before discussing their techniques, we bring some definitions. Let , and be the beginning position, ending position, and exponent of the ith run of . Let P be be any subset of positions of . The sparse suffix array of w.r.t P denoted by is an array of size such that for all and also like a normal suffix array is lexicographically smaller than . They use trancated suffix array for denoted by , in which P is the set of for . In addition, let be an array of length m such that , where . Given , all the defined arrays can be computed in time with working space [14].
In order to compute s from , they consider three disjoint partitions of , . The first partition consists of s which are contained in runs. The second one is s that start at the last characters of runs, and all the other s considered to be in the third partitions. By the fact that each cannot be a proper substring of another and discussing the size of each partition separately, they prove that . Then, they show how to find in time and space, by computing the partitions.
In order to compute , for each character used in , they check whether there exists a run of that character with a unique maximum exponent, since the number of distinct character used in is , this procedure can be done in time and space using a sorting algorithm. For computing and , they use , (inverse of ), and arrays. Similar to , and can be computed in time and space. After the preprocessing step, they could answer Problem 3 using RMQ, predecessor and successor queries on the array that stores the lengths of the s. Their results were summarized in the following theorem.
Theorem 6
([8]). Given of size m, there is a data structure of size that can be computed in time which can answer interval queries in time, where is the output size, and , , and are the size, construction time and the query time for a predecessor/successor query on m elements from a universe of .
4.3. Compact Data Structures’ Approaches
In Section 3.4, we have discussed Mieno et al’s work [8] for solving Problem 1. They also presented a compact data structure for Problem 2 [14]. Their data structure has the size of bits and outputs an interval query in time, where m is the number of minimal unique substrings (s) of the input string which is at most n. Their technique is based on Mieno et al.’s work [8], which is discussed in Section 4.2. Similar to the technique for solving position-, their structure is based on the two bit arrays and . In addition, they use array which has been used in Mieno et al.’s work for storing the length of the s [8]. On the top of and , a successor and a predecessor data structures are maintained. is also endowed with RMQ data structure. Now, once an interval query comes, first we can find and , then we can find the range of covering . Consequently, it is sufficient to answer RMQ queries on the corresponding range on . The space needed for and and the predecessor and successor data structure is bits. The RMQ data structure on takes bits of space. Thus, using the constant query time of the RMQ data structure (see Section 2.2), their results can be summarized as follows:
Theorem 7.
There exists a data structure of size bits that can answer interval problem in time, where is the number of the answers corresponding to the given interval.
5. Approximate-SUS Queries
5.1. Motivation
In molecular biology, shortest unique substrings found in DNA sequences can be used to compare similar organisms and determine unique patterns. It also helps to design polymer chain reaction (PCR) [1,15]. If we just consider exact shortest unique substrings while comparing distinct organisms, possible patterns might be disregarded due to errors or mutations. In this section, we discuss the approximate version of problem which is proposed by Hon et al. [3], in which the uniqueness constraint is more strict. In this variant of problem, the unique substrings are allowed for up to k mismatches. This version can be applied in computational biology, where factors such as genetic mutation and experimental error make approximate string matching necessary [9]. Another useful application of this approximate version is in computing average common substring, which has been considered as an approach to phylogenomic reconstruction [37,38]. In order to estimate the evolutionary distance between pairs of primate genomes, Thankachan et al. [39] showed that adding a similar k-mismatch parameter to average common substring finding equation leads to better results [9]. A k-mismatch shortest unique substring covering a position p, denoted by , is a k-mismatch unique substring such that satisfies the condition and there is no other k-mismatch unique substring with shorter length which satisfies the condition. Note that, similar to the definition of k-mismatch in Section 2.1, we also consider the Hamming distance for the k-mismatch problem. The problem is formally defined as follows:
Problem 4.
k-mismatch Queries
Input: String and integer k
Output: Two integer arrays A and B s.t is the rightmost for every index i
Example 3.
If, . Note that as we saw in Example 1. Since there exists a substring such that , cannot be a .
5.2. Hon et al.’s Framework
Hon, Thankachan, and Xu considered solving Problem 4 in the place of S, A, and B, where S is for storing string , A and B are two empty arrays for storing the starting and ending positions of the rightmost k-mismatch of each position. They prove the k-mismatch version of Lemma 2. Here, we explain the high level of their in-place framework for solving both Problems 1 and 4 which has three stages.
Let be the k-mismatch left bounded shortest unique substring starting at position p. In the first stage, they compute for all is, in place of A and B. At the end of this step, each stores the ending position of . This procedure takes linear time when and time for because of their dynamic programming approach. Let be the shortest k-mismatch covering position p. In the second stage, they use array B, computed from the previous stage, to find the rightmost for all i in place of A and B. Each stores the largest j such that is an . Thus, if exists, it is equal to . This takes time for all . At the last stage, they use both A and B to compute for all i, in the place of A and B. At the end of the stage, and store the starting and ending positions of the rightmost . This step also takes linear time for any . Note that only one stage requires the quadratic time. They implemented their algorithm in C using library for the suffix array construction. Their contribution can be summarized as follows:
Theorem 8.
Using an additional words each of size bits, and n bytes of space for storing the input string, Hon et al. [3] provided a theoretically in-place framework to solve both exact and the approximate position using a total of and time, respectively.
5.3. Allen et al.’s Framework
In order to reduce the quadratic time complexity for solving k-mismatch queries, Allen et al. [9] presented an average time complexity of using space. They follow the technique provided by Thankachan et al. [39] for solving k-mismatch average common substring.
Let the k-mismatch left-bounded longest repeat starting at position i denoted by be a k-mismatch repeat substring such that either or is a k-mismatch unique. Consider L as an array of length n where . Now, the algorithm discussed in Section 5.2 can be applied to compute for each position i in in linear time and space. Computing array L can be a substitution for stage one of the Hon et al.’s work [3]. To avoid time, instead of comparing every pair of suffixes of , they presented an expected time for calculating each entry of L using the technique in [39], where m is the total length of two suffixes used in computing as follows:
The total time complexity to compute every entry of L would be expected .
In 2020, Allen et al. [10] published a new version of their work which provides algorithmic bounds for k-mismatch problem. They presented an algorithm which can solve Problem 4 using time and space. This worst-case bound is asymptotically much better than the practical algorithm. In comparison to Hon et al’s experimental results [3], their new implementation shows that the practical algorithm is easy to implement and it takes less time when k is small relative to n.
5.4. GPU Based Approach
All the works that have been discussed on k-mismatch query were focusing on improving the time and space complexity for computation in the sequential CPU model. In 2018, Schultz and Xu [15] presented the first parallel approach for k-mismatch problem in the shared-memory model, particularly leveraging on the massive multi-threading GPU technology. Obviously, this approach is experimentally faster than the CPU solutions. Their experimental results on a mid-end GPU show that the GPU approach is at least six times faster than the CPU approach in the case of (exact match), and at least 23 times faster in the general case (k > 0). This is essential when the input string is massively long as genomic sequences are studied. In terms of memory usage, this approach is almost the same as the sequential CPU approaches. Their algorithms totally happen on GPU except a transferring input string to GPU, which is done by the CPU host. When there are no mismatches (), they implicitly use the efficient data structures like suffix array and array. However, when , using these data structures would be problematic since they do not have enough information such as k-mismatch s. To solve this issue, Schultz and Xu [15] designed a method that can be parallelized in the GPU architecture.
In 2019, they published an extension of their paper [16], providing more experimental results on exact and approximate computation. They show the speedup gained by the GPU-based approach against the sequential solution, including or excluding suffix array construction. It is important to observe the difference of speedup in case of whether the suffix array is given or not. Corresponding to their results, almost of the time of the sequential approach is to spend on suffix array construction. Thus, if the suffix array is given, it can be copied to the GPU memory directly to achieve a better speedup.
6. SUPS Queries
6.1. Motivation
We call a substring is a palindrome if it is identical with its reverse. There have been many studies on palindromic strings and their combinatorial properties [32,40,41]. A palindromic string or substring is an important structure in DNA, RNA, or protein sequence analysis [22]. In biology data, palindromic structures show the ability of molecules to fold and form double-stranded stems [23]. In addition, by the similar palindromic structures of the protein strands, we can guess their similar secondary structures. Another application of palindromic substructures is in gene editing and gene regulation in species [22,42]. In this section, we are going to discuss a new version of queries named shortest unique palindromic substring() problem. A shortest unique palindromic substring for an interval is the shortest substring such that it is unique in and contains , and any other palindromic substring of , which contains and is shorter than , which occurs more than once in . The problem is formally defined as follows:
Problem 5.
Queries
Preprocess:
Query: An interval
Return: All of containing the query interval
Example 4.
If, given the query interval , problem outputs , which is the shortest palindrome containing .
6.2. Optimal Approaches
In 2018, Inoue et al. [4] introduced Problem 5, which is a version of queries focusing on palindromes. They propose an optimal algorithm which takes time to preprocess the string and can return all for any interval query in time, where is the number of outputs. For any interval , checking whether is a palindromic substring or not takes preprocessing time and constant query time using space. Similar to the approach for solving position/interval , they define Minimal Unique Palindromic substring(MUPS). is a unique palindrome substring in such that is a repeat in or . Similar to s, cannot contain each others. In addition, for each for some interval, there exists exactly one that is contained in with the same center as . Using this fact, they could preprocess in linear time to report all s.
In addition, they provided some bounds on the number of s. They showed the maximum number of point and interval in , which is a string from an arbitrary size alphabet, is at most n. For a binary alphabet, they gave a lower bound for the number of point . Let and be the set of substrings of where each of the substrings in is an for some point and each of the substrings in is an for some interval.
Theorem 9
([4]). There exists a binary string of length such that for any . Thus, contains at least point and interval s where
In the end, they discuss the number of point and interval s on palindromic rich strings. A string is a palindromic rich if it has distinct palindromic substring including an empty string. They consider the palindromic rich string and prove two following theorems.
Theorem 10
([4]). There exists a binary string of length n such that the number of interval s is .
Theorem 11
([4]). There exists a binary string of size s.t for any . Thus, contains interval .
6.3. RLE-Based Approaches
Watanabe et al. [11] considered the shortest unique palindromic substring when the input string is given in its representation. They showed how to preprocess a given of length m in space and time, where is the umber of distinct runs of . Their work is the first space-economial for problem and can answer queries in time.
Similar to the technique for solving the normal problem which is discussed in the previous subsection, they compute s of S in the preprocessing step. Their idea is based on the fact that the number of s of any string is at most m[11]. They first compute the palindromic substrings whose center is the same as the center for some run in the string. These palindromic substrings are called run-centered palindromes. In order to compute these substrings, they utilize Manacher’s algorithm [43]. This algorithm scans the string and constructs an array of length in which the ith entry is the length of the maximal palindrome with center i. Using this technique, they compute all the -maximal palindromes of in time and space.
In order to improve the space complexity, they build a data structure which is a modified eertree of the input string to deal with the run-length encoded string instead of the original one. They prove that this new data structure, RLE-eertree (), has nodes which is linear to the length of the . Thus, it takes less space complexity compared with the original eertree. The construction time for is . They used this tree to compute all the s of . Using the list of s, they present their algorithm for queries. Their technique is similar to Inoue et al.’s work [4] with the difference that the space complexity should be .
In 2020, Watanabe et al. [12] published an extension of their work by considering a variant of problem where a query interval is also given in a run-length encoded representation. Their technique is similar to their previous approach [11]. They used combinatorial properties of maximal palindromes and presented an space data structure to answer queries in time [12].
7. Range-SUS
Range queries are a classic data structure topic, which has a great motivation in string processing problems [5,24,25,26]. In 2019, Abedin et al. [5] studied the local shortest unique substring which starts in a specific region or range of the string. Given a range , the problem is to return a shortest substring of with exactly one occurrence in ; i.e., , there is no such that , and h is minimal. Abedin et al. presented an -word data structure which answers queries in time per query in the word RAM model, where is the word size [5]. The Range Shortest Unique Substring problem is formally defined as follows:
Problem 6.
Queries
Preprocess: String .
Query: Range , where .
Output: such that is a shortest string with exactly one occurrence in .
If , the answer is trivial. Thus, in the rest, we assume that .
Example 5
([5]). Given and a query , we need to find a shortest substring of with exactly one occurrence in . The output here is since is the shortest substring of with exactly one occurrence in .
For each position , Abedin et al. [5] provide a data structure to keep track of the last and next occurrence of substring , denoted by and , respectively. They define and as follows:
Their main result is providing an upper bound on the size of all s, which is given in the following lemma.
Lemma 6
([5]).
By the definition, is the length of the shortest substring that starts at position k, and is the set of candidate length for the answer, which starts at position k.
Abedin et al. solved queries by reducing the problem to an top-1 rectangle stabbing query on a set of rectangles with input point . Assume that the answer for Problem 6 is . Given a query range , the answer we are looking for is the pair with the minimum h under the following conditions: , , and . Equivalently, is the pair with the minimum h, such that , , and .
In the preprocessing step, they map each into a weighted rectangle with weight h and defined as follows:
After query comes, the lowest weighted rectangle which stabbed by the point is . By combining the optimal data structure for top-1 rectangle stabbing presented by Chan et al. [44] and the bound on the number of candidate lengths in Lemma 6, they prove the following result.
Theorem 12.
There exists an -word data structure which answers queries in time per query in the word RAM model, where is the word size.
8. Discussion and Future Work
In this paper, we reviewed several types of shortest unique substring queries and their corresponding solutions. All the discussed problems in this manuscript are highly motivated topics in string processing and computational biology research areas. Although we discussed more than 10 approaches for variants of queries, there still exist related topics which have not been studied, or there is no efficient algorithm to solve them. In this section, we discuss some of such topics and open questions for future work:
- We discussed all the solutions to solve approximate queries in Section 5. However, there is no efficient in-place algorithm which can find s to get s afterward. Another technique that can be applied to solve approximate queries is considering the representation of the input string. Section 4.2 shows this technique for solving interval- queries. To our knowledge, an based approach for solving approximate queries has not been studied. In addition, there is no work considering the standard external memory model for solving an approximate problem. As the I/O-efficient construction of the suffix array and array exist [45,46,47,48], it seems to be possible to change the RAM model algorithm for the construction of these arrays to the external memory model.
- In Section 4.2, the in the query time of Theorem 6 is , which is actually the time for performing dynamic predecessor/successor queries using space [8]. In order to make the query time faster using the same space, the question is if there exists a data structure of size that can efficiently answer Problem 3 without using predecessor/successor.
- As we discussed in Section 6, palindromic substrings have great motivations in computational biology. All the reviewed works are on finding the exact s. Similar to the approximate problem, approximate query is also important to be studied for considering errors and mutations. Besides the definition of Problem 5, the following definition has a great motivation in bioinformatics. A nucleotide sequence is considered as a palindrome if the reverse of its complementary strand is equal to the original sequence [49]. The question is if the methods discussed in Section 6 can be applied to efficiently solve this problem.
- The last topic that we discussed was the problem. According to Theorem 12, queries can be solved in time using a data structure of size word. The question is whether we can design an efficient -word data structure for the problem. In addition, the approximate version of queries has not been studied. It is possible to combine the technique discussed in Section 7 and the framework of Thankachan et al. [50] to provide an efficient algorithm for approximate problem.
- Besides shortest unique substrings, Maximal Unique Matches is an important concept in computational biology for aligning two long genome sequences [51]. Ganguly et al. [18] applied a similar technique discussed in Section 3.4 to find maximal unique matches of two strings. As far as we are aware, the dynamic version (when mismatches are allowed) of this problem has not been studied yet. We believe that, by modifying the techniques on the dynamic longest common substring problem ( after k mismatches) [52,53,54], the approximate Maximal Unique Matches problem can be solved in subquadratic time.
Author Contributions
Methodology, writing—original draft, P.A.; writing—review and editing, M.O.K.; supervision, writing—review and editing, S.V.T. All authors have read and agreed to the published version of the manuscript.
Funding
This work is supported in part by the U.S. National Science Foundation (NSF) under CCF-1703489.
Conflicts of Interest
The authors declare no conflict of interest.
References
- Pei, J.; Wu, W.C.H.; Yeh, M.Y. On Shortest Unique Substring Queries. In Proceedings of the 2013 IEEE 29th International Conference on Data Engineering (ICDE), Brisbane, Australia, 8–11 April 2013; pp. 937–948. [Google Scholar]
- Hu, X.; Pei, J.; Tao, Y. Shortest Unique Queries on Strings. In Proceedings of the String Processing and Information Retrieval-21st International Symposium—SPIRE 2014, Ouro Preto, Brazil, 20–22 October 2014; Lecture Notes in Computer Science. de Moura, E.S., Crochemore, M., Eds.; Springer: Cham, Switzerland, 2014; Volume 8799, pp. 161–172. [Google Scholar] [CrossRef]
- Hon, W.; Thankachan, S.V.; Xu, B. In-place algorithms for exact and approximate shortest unique substring problems. Theor. Comput. Sci. 2017, 690, 12–25. [Google Scholar] [CrossRef]
- Inoue, H.; Nakashima, Y.; Mieno, T.; Inenaga, S.; Bannai, H.; Takeda, M. Algorithms and combinatorial properties on shortest unique palindromic substrings. J. Discrete Algorithms 2018, 52, 122–132. [Google Scholar] [CrossRef]
- Abedin, P.; Ganguly, A.; Pissis, S.P.; Thankachan, S.V. Range Shortest Unique Substring Queries. In Proceedings of the International Symposium on String Processing and Information Retrieval, Segovia, Spain, 7–9 October 2019; Springer: Berlin/Heidelberg, Germany, 2019; pp. 258–266. [Google Scholar]
- Ileri, A.M.; Külekci, M.O.; Xu, B. Shortest unique substring query revisited. In Symposium on Combinatorial Pattern Matching; Springer: Berlin/Heidelberg, Germany, 2014; pp. 172–181. [Google Scholar]
- Ileri, A.M.; Külekci, M.O.; Xu, B. A simple yet time-optimal and linear-space algorithm for shortest unique substring queries. Theor. Comput. Sci. 2015, 562, 621–633. [Google Scholar] [CrossRef]
- Mieno, T.; Inenaga, S.; Bannai, H.; Takeda, M. Shortest Unique Substring Queries on Run-Length Encoded Strings. In Proceedings of the 41st International Symposium on Mathematical Foundations of Computer Science, MFCS 2016, Kraków, Poland, 22–26 August 2016; LIPIcs, Faliszewski, P., Muscholl, A., Niedermeier, R., Eds.; Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik: Dagstuhl, Germany, 2016; Volume 58, pp. 69:1–69:11. [Google Scholar] [CrossRef]
- Allen, D.R.; Thankachan, S.V.; Xu, B. A Practical and Efficient Algorithm for the k-mismatch Shortest Unique Substring Finding Problem. In Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics—BCB 2018, Washington, DC, USA, 29 August–1 September 2018; Shehu, A., Wu, C.H., Boucher, C., Li, J., Liu, H., Pop, M., Eds.; ACM: New York, NY, USA, 2018; pp. 428–437. [Google Scholar] [CrossRef]
- Allen, D.R.; Thankachan, S.V.; Xu, B. An Ultra-Fast and Parallelizable Algorithm for Finding k-Mismatch Shortest Unique Substrings. IEEE/ACM Trans. Comput. Biol. Bioinform. 2020. [Google Scholar] [CrossRef] [PubMed]
- Watanabe, K.; Nakashima, Y.; Inenaga, S.; Bannai, H.; Takeda, M. Shortest Unique Palindromic Substring Queries on Run-Length Encoded Strings. In Proceedings of the Combinatorial Algorithms-30th International Workshop, IWOCA 2019, Pisa, Italy, 23–25 July 2019; pp. 430–441. [Google Scholar] [CrossRef]
- Watanabe, K.; Nakashima, Y.; Inenaga, S.; Bannai, H.; Takeda, M. Fast Algorithms for the Shortest Unique Palindromic Substring Problem on Run-Length Encoded Strings. Theory Comput. Syst. 2020. [Google Scholar] [CrossRef]
- Tsuruta, K.; Inenaga, S.; Bannai, H.; Takeda, M. Shortest Unique Substrings Queries in Optimal Time. In Proceedings of the SOFSEM 2014: Theory and Practice of Computer Science-40th International Conference on Current Trends in Theory and Practice of Computer Science, Nový Smokovec, Slovakia, 26–29 January 2014; Lecture Notes in Computer Science. Geffert, V., Preneel, B., Rovan, B., Stuller, J., Tjoa, A.M., Eds.; Springer: Berlin/Heidelberg, Germany, 2014; Volume 8327, pp. 503–513. [Google Scholar] [CrossRef]
- Mieno, T.; Köppl, D.; Nakashima, Y.; Inenaga, S.; Bannai, H.; Takeda, M. Compact Data Structures for Shortest Unique Substring Queries. In Proceedings of the International Symposium on String Processing and Information Retrieval, Segovia, Spain, 7–9 October 2019; Springer: Berlin/Heidelberg, Germany, 2019; pp. 107–123. [Google Scholar]
- Schultz, D.W.; Xu, B. On k-Mismatch Shortest Unique Substring Queries Using GPU. In Proceedings of the Bioinformatics Research and Applications-14th International Symposium—ISBRA 2018, Beijing, China, 8–11 June 2018; pp. 193–204. [Google Scholar] [CrossRef]
- Schultz, D.W.; Xu, B. Parallel Methods for Finding k-Mismatch Shortest Unique Substrings Using GPU. IEEE/ACM Trans. Comput. Biol. Bioinform. 2019. [Google Scholar] [CrossRef]
- Hon, W.; Thankachan, S.V.; Xu, B. An In-place Framework for Exact and Approximate Shortest Unique Substring Queries. In Proceedings of the Algorithms and Computation-26th International Symposium—ISAAC 2015, Nagoya, Japan, 9–11 December 2015; pp. 755–767. [Google Scholar] [CrossRef]
- Ganguly, A.; Hon, W.K.; Shah, R.; Thankachan, S.V. Space-time trade-offs for the shortest unique substring problem. In Proceedings of the 27th International Symposium on Algorithms and Computation (ISAAC 2016), Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, Sydney, Australia, 12–14 December 2016. [Google Scholar]
- Haubold, B.; Pierstorff, N.; Möller, F.; Wiehe, T. Genome comparison without alignment using shortest unique substrings. Bmc Bioinform. 2005, 6, 123. [Google Scholar] [CrossRef] [PubMed]
- Tarhio, J.; Peltola, H. String matching in the DNA alphabet. Software Pract. Exp. 1997, 27, 851–861. [Google Scholar] [CrossRef]
- Adas, B.; Bayraktar, E.; Faro, S.; Moustafa, I.E.; Külekci, M.O. Nucleotide Sequence Alignment and Compression via Shortest Unique Substring. In Proceedings of the Bioinformatics and Biomedical Engineering-Third International Conference—IWBBIO 2015, Granada, Spain, 15–17 April 2015; Lecture Notes in Computer Science. Guzman, F.M.O., Rojas, I., Eds.; Springer: Berlin/Heidelberg, Germany, 2015; Volume 9044, Part II. pp. 363–374. [Google Scholar] [CrossRef]
- Kim, H.; Han, Y.S. OMPPM: Online multiple palindrome pattern matching. Bioinformatics 2016, 32, 1151–1157. [Google Scholar] [CrossRef][Green Version]
- Kolpakov, R.; Kucherov, G. Searching for gapped palindromes. Theor. Comput. Sci. 2009, 410, 5365–5373. [Google Scholar] [CrossRef][Green Version]
- Amir, A.; Apostolico, A.; Landau, G.M.; Levy, A.; Lewenstein, M.; Porat, E. Range LCP. J. Comput. Syst. Sci. 2014, 80, 1245–1253. [Google Scholar] [CrossRef]
- Abedin, P.; Ganguly, A.; Hon, W.K.; Matsuda, K.; Nekrich, Y.; Sadakane, K.; Shah, R.; Thankachan, S.V. A linear-space data structure for range-LCP queries in poly-logarithmic time. Theor. Comput. Sci. 2020, 163, 245–251. [Google Scholar]
- Kociumaka, T.; Radoszewski, J.; Rytter, W.; Waleń, T. Internal pattern matching queries in a text and applications. In Proceedings of the twenty-sixth annual ACM-SIAM symposium on Discrete algorithms, Portland, OR, USA, 5–7 January 2014; SIAM: Philadelphia, PA, USA, 2014; pp. 532–551. [Google Scholar]
- Weiner, P. Linear Pattern Matching Algorithms. In Proceedings of the 14th Annual Symposium on Switching and Automata Theory (Swat 1973), Iowa City, IA, USA, 15–17 October 1973; IEEE Computer Society: Washington, DC, USA, 1973; pp. 1–11. [Google Scholar] [CrossRef]
- Manber, U.; Myers, G. Suffix arrays: A new method for online string searches. Siam J. Comput. 1993, 22, 935–948. [Google Scholar] [CrossRef]
- Kärkkäinen, J.; Sanders, P. Simple linear work suffix array construction. In Proceedings of the International Colloquium on Automata, Languages, and Programming, Eindhoven, The Netherlands, 30 June–4 July 2003; Springer: Berlin/Heidelberg, Germany, 2003; pp. 943–955. [Google Scholar]
- Fischer, J.; Heun, V. Space-Efficient Preprocessing Schemes for Range Minimum Queries on Static Arrays. SIAM J. Comput. 2011, 40, 465–492. [Google Scholar] [CrossRef]
- Willard, D.E. Log-Logarithmic Worst-Case Range Queries are Possible in Space Theta(N). Inf. Process. Lett. 1983, 17, 81–84. [Google Scholar] [CrossRef]
- Rubinchik, M.; Shur, A.M. EERTREE: An efficient data structure for processing palindromes in strings. In International Workshop on Combinatorial Algorithms; Springer: Berlin/Heidelberg, Germany, 2015; pp. 321–333. [Google Scholar]
- Ukkonen, E. On-line construction of suffix trees. Algorithmica 1995, 14, 249–260. [Google Scholar] [CrossRef]
- Pei, J.; Wu, W.C.; Yeh, M. On shortest unique substring queries. In Proceedings of the 29th IEEE International Conference on Data Engineering, ICDE 2013, Brisbane, Australia, 8–12 April 2013; Jensen, C.S., Jermaine, C.M., Zhou, X., Eds.; IEEE Computer Society: Washington, DC, USA, 2013; pp. 937–948. [Google Scholar] [CrossRef]
- Aggarwal, A.; Vitter, J.S. The input/output complexity of sorting and related problems. Commun. ACM 1988, 31, 1116–1127. [Google Scholar] [CrossRef]
- Tamakoshi, Y.; Goto, K.; Inenaga, S.; Bannai, H.; Takeda, M. An opportunistic text indexing structure based on run length encoding. In Proceedings of the International Conference on Algorithms and Complexity, Paris, France, Germany, 20–22 May 2015; Springer: Berlin/Heidelberg, Germany, 2015; pp. 390–402. [Google Scholar]
- Ulitsky, I.; Burstein, D.; Tuller, T.; Chor, B. The average common substring approach to phylogenomic reconstruction. J. Comput. Biol. 2006, 13, 336–350. [Google Scholar] [CrossRef]
- Hooshmand, S.; Tavakoli, N.; Abedin, P.; Thankachan, S.V. On computing average common substring over run length encoded sequences. Fundam. Informaticae 2018, 163, 267–273. [Google Scholar] [CrossRef]
- Thankachan, S.V.; Chockalingam, S.P.; Liu, Y.; Apostolico, A.; Aluru, S. ALFRED: A practical method for alignment-free distance computation. J. Comput. Biol. 2016, 23, 452–460. [Google Scholar] [CrossRef]
- Bannai, H.; Gagie, T.; Inenaga, S.; Kärkkäinen, J.; Kempa, D.; Piątkowski, M.; Puglisi, S.J.; Sugimoto, S. Diverse palindromic factorization is NP-complete. In Proceedings of the International Conference on Developments in Language Theory, Liverpool, UK, 27–30 July 2015; Springer: Berlin/Heidelberg, Germany, 2015; pp. 85–96. [Google Scholar]
- Borozdin, K.; Kosolobov, D.; Rubinchik, M.; Shur, A.M. Palindromic length in linear time. In Proceedings of the 28th Annual Symposium on Combinatorial Pattern Matching (CPM 2017), Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, Warsaw, Poland, 4–6 July 2017. [Google Scholar]
- Mali, P.; Esvelt, K.M.; Church, G.M. Cas9 as a versatile tool for engineering biology. Nat. Methods 2013, 10, 957–963. [Google Scholar] [CrossRef] [PubMed]
- Manacher, G. A New Linear-Time“On-Line”Algorithm for Finding the Smallest Initial Palindrome of a String. J. ACM (JACM) 1975, 22, 346–351. [Google Scholar] [CrossRef]
- Chan, T.M.; Larsen, K.G.; Patrascu, M. Orthogonal Range Searching on the RAM, Revisited. In Proceedings of the 27th Annual Symposium on Computational Geometry 2011, Paris, France, 13–15 June 2011; pp. 1–10. [Google Scholar]
- Kärkkäinen, J.; Kempa, D.; Puglisi, S.J. Parallel external memory suffix sorting. In Annual Symposium on Combinatorial Pattern Matching; Springer: Berlin/Heidelberg, Germany, 2015; pp. 329–342. [Google Scholar]
- Kärkkäinen, J.; Kempa, D.; Puglisi, S.J.; Zhukova, B. Engineering external memory induced suffix sorting. In Proceedings of the 2017 Proceedings of the Ninteenth Workshop on Algorithm Engineering and Experiments (ALENEX), Barcelona, Spain, 17–18 January 2017; pp. 98–108.
- Kärkkäinen, J.; Kempa, D. Faster external memory LCP array construction. In Proceedings of the 24th Annual European Symposium on Algorithms (ESA 2016), Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, Aarhus, Denmark, 22–24 August 2016. [Google Scholar]
- Kärkkäinen, J.; Kempa, D. LCP array construction using O (sort (n))(or less) I/Os. In Proceedings of the International Symposium on String Processing and Information Retrieval, Beppu, Japan, 18–20 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 204–217. [Google Scholar]
- Anjana, R.; Shankar, M.; Vaishnavi, M.K.; Sekar, K. A method to find palindromes in nucleic acid sequences. Bioinformation 2013, 9, 255. [Google Scholar] [CrossRef] [PubMed]
- Thankachan, S.V.; Aluru, C.; Chockalingam, S.P.; Aluru, S. Algorithmic framework for approximate matching under bounded edits with applications to sequence analysis. In Proceedings of the International Conference on Research in Computational Molecular Biology, Paris, France, 21–24 April 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 211–224. [Google Scholar]
- Delcher, A.L.; Kasif, S.; Fleischmann, R.D.; Peterson, J.; White, O.; Salzberg, S.L. Alignment of whole genomes. Nucleic Acids Res. 1999, 27, 2369–2376. [Google Scholar] [CrossRef]
- Kociumaka, T.; Radoszewski, J.; Starikovskaya, T. Longest common substring with approximately k mismatches. Algorithmica 2019, 81, 2633–2652. [Google Scholar] [CrossRef]
- Abedin, P.; Hooshmand, S.; Ganguly, A.; Thankachan, S.V. The heaviest induced ancestors problem revisited. In Proceedings of the Annual Symposium on Combinatorial Pattern Matching (CPM 2018). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, Qingdao, China, 2–4 July 2018. [Google Scholar]
- Flouri, T.; Giaquinta, E.; Kobert, K.; Ukkonen, E. Longest common substrings with k mismatches. Inf. Process. Lett. 2015, 115, 643–647. [Google Scholar] [CrossRef]
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).