Quantum Algorithms for Some Strings Problems Based on Quantum String Comparator

We study algorithms for solving three problems on strings. These are sorting of n strings of length k, “the Most Frequent String Search Problem”, and “searching intersection of two sequences of strings”. We construct quantum algorithms that are faster than classical (randomized or deterministic) counterparts for each of these problems. The quantum algorithms are based on the quantum procedure for comparing two strings of length k in O(k) queries. The first problem is sorting n strings of length k. We show that classical complexity of the problem is Θ(nk) for constant size alphabet, but our quantum algorithm has O˜(nk) complexity. The second one is searching the most frequent string among n strings of length k. We show that the classical complexity of the problem is Θ(nk), but our quantum algorithm has O˜(nk) complexity. The third problem is searching for an intersection of two sequences of strings. All strings have the same length k. The size of the first set is n, and the size of the second set is m. We show that the classical complexity of the problem is Θ((n+m)k), but our quantum algorithm has O˜((n+m)k) complexity.

In this paper, we consider three problems: • Strings Sorting problem; • the Most Frequent String Search problem; • Intersection of Two String Sequences problem.
Our algorithms use some quantum algorithms as a subroutine, and the remaining part is classical. We investigate the problems in terms of query complexity. The query model is one of the most popular in the case of quantum algorithms. Such algorithms can do a query to a black box that has access to the sequence of strings. As a running time of an algorithm, we mean a number of queries to the black box.
In the paper, we suggested a quantum comparison procedure for two strings. We show that its quantum complexity is Θ( √ k), where k is the length of strings. The classical complexity is Θ(k). Thus, the quantum algorithm has a quadratic speed-up compared to classical algorithms. We propose a quantum algorithm that is based on "the first one search" (The minimal element satisfying a condition) problem algorithm from [23][24][25][26]. This algorithm is a modification of Grover's search algorithm [27,28]. Another important algorithm for the search is described in [29]. Using this idea, we obtain quantum algorithms for several problems.
The first problem is the String Sorting problem. Assume that we have n strings of length k. It is known [30] that no quantum algorithm can sort arbitrary comparable objects faster than O(n log n). At the same time, several researchers tried to improve the hidden constant [31,32]. Other researchers investigated the space bounded case [33]. We focus on sorting strings. In a classical case, we can use an algorithm that is better than arbitrary comparable objects sorting algorithms. It is radix sort that has O(nk) query complexity [34] for a finite size alphabet. It is also a lower bound for classical (randomized or deterministic) algorithms that is Ω(nk). Our quantum algorithm for the string sorting problem has query complexity O(n(log n) · √ k) =Õ(n √ k), whereÕ does not consider log factors. It is based on standard sorting algorithms [34] or Heapsort [34,35] and the quantum algorithm for comparing strings. Additionally, we use the idea of a noisy comparison procedure for sorting [36].
The second problem is the following. We have n strings of length k. We can assume that string symbols are letters from any constant size alphabet, for example, binary, Latin alphabet, or Unicode. The problem is finding the string that occurs in the sequence most often. The problem [37] is one of the most well-studied ones in the area of data streams [38][39][40][41]. Many applications in packet routing, telecommunication logging, and tracking keyword queries in search machines are critically based on such routines. The bestknown classical (randomized or deterministic) algorithms require Ω(nk) queries because an algorithm should at least test all symbols of all strings. The deterministic solution can use the radix sort algorithm [34] or the Trie (prefix tree) [42][43][44][45] that allow achieving the required complexity.
We propose a quantum algorithm that is based on the sorting algorithm from the first problem. Our algorithm for the most frequent string search problem has query complexity O(n(log n) · √ k) =Õ(n √ k). If log 2 n = o( √ k), then our algorithm is better than classical counterparts. Note that this setup makes sense in practical cases.
The third problem is the Intersection of Two String Sequences problem. Assume that we have two sequences of strings of length k. The size of the first set is n, and the size of the second one is m. The first sequence is given, and the second one is given in an online fashion, one by one. After each requested string from the second sequence, we want to check whether this string belongs to the first sequence. We propose a quantum algorithm for the problem with quantum query complexity O((n + m(log m + log log n)) · log n · √ k) = O((n + m) √ k). The algorithm uses a quantum algorithm for sorting strings. At the same time, the best-known classical (randomized or deterministic) algorithm requires Ω((n + m)k) queries, and this bound is achieved using the radix sort algorithm or the Trie data structure.
The paper is an extended version of a conference paper [46]. The structure of the paper is the following. Discussion on the computation model is situated in Section 2. We present the quantum subroutine that compares two strings in Section 3. Then, we discuss three problems: Strings Sorting problem in Section 4, the Most Frequent String Search problem in Section 5, and Intersection of Two String Sequences problem in Section 6. Section 7 contains the conclusions.

Preliminaries
We use the standard form of the quantum query model. Let f : D → {0, 1}, D ⊆ {0, 1} N be an N variable function. An input for the function is x ∈ D. We are given an oracle access to the input x, i.e., it is realized by a specific unitary transformation usually defined as |i |z |w → |i |z + x i (mod 2) |w , where the |i register indicates the index of the variable we are querying, |z is the output register, and |w is some auxiliary workspace. Note that we use Dirac notation vectors. An algorithm in the query model consists of alternating applications of arbitrary unitaries independent of the input and the query unitary and a measurement in the end.
In the case of non-binary input, we present the input variables in binary form. Using alternating unitaries independent of the input, we can store bits in auxiliary work-space |w and use the obtained variable in an algorithm. In the case of computing a complex function f and additionally non-binary input, we can consider a block of alternating unitaries independent of the input and the query unitary that stores required variables in the auxiliary work-space |w . Then, we compute the Boolean value of the function f on arguments and store them in the auxiliary work-space |w . After that, we can use the value of the function f in our algorithms.
The smallest number of queries for an algorithm that outputs f (x) with a probability that is at least 2 3 on all x is called the quantum query complexity of the function f and is denoted by Q( f ). We refer the readers to [1][2][3] for more details on quantum computing.
In the quantum algorithms in this article, we discuss quantum query complexity. We use modifications of Grover's search algorithm [27,28] as quantum subroutines. For these subroutines, time complexity (number of gates in a circuit) is more than query complexity for an additional log factor. Note that the query can be implemented using the CNOT gate.

The Quantum Algorithm for Comparing Two Strings
Firstly, we discuss a quantum subroutine that compares two strings of length k. Assume that this subroutine is COMPARE_STRINGS(s, t, k), and it compares s and t in the lexicographical order. It returns: As a base for our algorithm, we use the algorithm of finding the minimal argument with 1-result of a Boolean-value function. Formally, we have: Let us choose the function f (j) = (s j = t j ). Thus, we search for j 0 that is the index of the first unequal symbol of the strings. Then, we can claim that s precedes t in the lexicographical order if the symbol s j 0 precedes the symbol t j 0 . The claim is right by the definition of the lexicographical order. If there are no unequal symbols, then the strings are equal.
If we discuss the implementation of the f , then we can say that for computing the value f (j), we store the binary representation of s j and t j in the auxiliary work-space, for example, |ψ s and |ψ t . Then, compute the value of f (j) and store it in a qubit |φ . After that, we can use this value in the algorithm. The last step is clearing |φ using values of |ψ s and |ψ t and the CNOT gate; then, clearing |ψ s and |ψ t repeatedly using the same queries (that use CNOT gates). All these manipulations take a constant number of queries because of the constant size of the input alphabet.
We use THE_FIRST_ONE_SEARCH( f , k) as a subroutine from Lemma 1, where f (j) = (s j = t j ). Assume that this subroutine returns k + 1 if it does not find any solution or the found argument j is such that f (j ) = 0.
We use the standard technique of boosting success probability. Thus, we repeat the subroutine log 2 (δ −1 ) times and return the minimal answer.
Suppose the subroutine has an error. There are two cases. The first one is finding the index of unequal symbols that is not the minimal one. In the second case, the algorithm does not find unequal symbols. Then, we assume that it returns k + 1. Thus, in a case of an error, the subroutine returns a value that is bigger than the correct answer.
Therefore, if at least one subroutine invocation has no error, then the whole algorithm succeeds. All error events are independent. The error probability of the whole algorithm is the probability of error for all invocations of the subroutine, that is O Let us present the Algorithm 1.
Algorithm 1 COMPARE_STRINGS(s, t, k). The Quantum Algorithm for Comparing Two Strings.
The initial value The next property follows from the previous discussion. Lemma 2. Algorithm 1 compares two strings of length k in the lexicographical order with query complexity O( √ k log δ −1 ) and error probability O(δ) for some integer k and 0 < δ < 1.
The algorithm finds the minimal index of unequal symbols j 0 . We can say that j 0 − 1 is the length of the longest common prefix for these strings.
We can show that the lower bound for the problem is Ω( Lemma 3. Any quantum algorithm for Comparing Two Strings problem has Ω( √ k) query complexity.
If z < k/2 , then t > s. If z = k/2 , then t = s. If z > k/2 , then t < s. Therefore, the problem is at least as hard as the search for 1 among the first k/2 variables in the string t.
It is known [14] that the quantum query complexity of the unstructured search among k/2 variables is Ω( √ k).
At the same time, the classical complexity of the problem is Θ(k).

Lemma 4. Randomized query complexity for Comparing Two Strings problem is Θ(k).
Proof. Due to the proof of Lemma 3, the problem is at least as hard as the search for 1 among the first k/2 variables in the string t. It is known [14] that the randomized query complexity of the unstructured search among k/2 variables is Ω(k).
At the same time, we can check all symbols sequentially to search the first unequal symbol. This algorithm has O(k) query complexity.
Additionally, we can compute the complexity of any algorithm based on the two strings comparison procedure.
Lemma 5. Suppose we have some integer n, integer A = A(n) and ε such that lim n→∞ ε/A = 0. Then, if a quantum algorithm does A(n) comparisons of strings of length k and has O(ε) error probability, then it does at most O(A √ k log(A/ε)) queries.
Proof. As a strings comparison procedure, we use COMPARE_STRINGS subroutine for δ = ε/A. Because of Lemma 2, the complexity of the subroutine is O( √ k log(A/ε)), and the error probability is O(ε/A). Because of A comparison operations, the total complexity of the algorithm is O(A √ k log(A/ε)). Let us discuss the error probability. Events of error in the algorithm are independent. Thus, all events should be correct. The error probability for one event is 1 − (1 − ε/A). Hence, the error probability for all A events is at least Note that Hence, the total error probability is at most O(ε).

Strings Sorting Problem
Let us consider the following problem. Problem. For some positive integers n and k, we have the sequence of strings s = (s 1 , . . . , s n ). Each s i = (s i 1 , . . . , s i k ) ∈ Σ k for some finite size alphabet Σ. We search an order ORDER = (i 1 , . . . , i n ) such that for any j ∈ {1, . . . , n − 1}, we have s i j ≤ s i j+1 in the lexicographical order.
We use one of the existing sorting algorithms (for example, Heapsort algorithm [34,35] or the Merge sort algorithm [34]) as a base and the quantum algorithm for string comparison from Section 3. In fact, our comparison function can have errors. That is why we use the result for "noisy computation" from [36]. The result is presented in the following lemma. Lemma 6 ([36], Theorem 3.5). Suppose we have a comparison procedure that works with error probability ε. Then there is a sorting algorithm with query complexity O(n log(n/ε)) and error probability at most ε.
The complexity of the algorithm is presented in the following theorem. Proof. The correctness of the algorithm follows from the description. Let ε = 0.1. Then, we apply the result from Lemma 6 and use the quantum comparison procedure that has ε error probability and O( query complexity. Therefore, the query complexity of the algorithm is O(n(log(n/ε)) · √ k) = O(n(log n) · √ k) =Õ(n √ k), and the error probability is ε.
We can show the lower bound for the problem. In that case, the answer is ORDER = (i 1 , . . . , i n−1 , u), where (i 1 , . . . , i n−1 ) is a permutation of integers from {1, . . . , u − 1, u + 1, . . . , n}. The searching for the required index u is at least as hard as the search for the 1-value variable s u v . It is known [14] that the quantum complexity of the unstructured search among nk variables is Ω( The lower bound for classical complexity can be proven by the same way as in Theorem 2. Theorem 3. The randomized query complexity of the Sorting problem is Θ(nk).
Proof. Due to the proof of Theorem 2, the problem is at least as hard as the search for 1 among nk variables in the strings s 1 , . . . , s n .
It is known [14] that the randomized query complexity of the unstructured search among nk variables is Ω(nk).
The Radix sort [34] algorithm reaches this bound and has O(nk) complexity in a case of a finite alphabet.

The Most Frequent String Search Problem
Let us formally present the problem. Problem. For some positive integers n and k, we have a sequence of strings s = (s 1 , . . . , s n ). Each s i = (s i 1 , . . . , s i k ) ∈ Σ k for some finite size alphabet Σ. Let #(t) = |{i ∈ {1, . . . , n} : s i = t}| be the number of occurrences of a string t. We search for i = argmax i∈{1,...n} #(s i ). If several strings satisfy the condition, then the answer is the index of the string with minimal index in the set s. Formally, i is such that: Firstly, we present an idea of the algorithm. The algorithm contains two steps. The first step is sorting the sequence of strings and obtaining ORDER = (i 1 , . . . , i n ) such that for any j ∈ {1, . . . , n − 1}, we have s i j ≤ s i j+1 in the lexicographical order. In that case, equal strings are situated sequentially. On the second step, we find each segment [i , i r ] of indexes for equal strings, i.e., s j = s i for j ∈ {i , . . . , i r } and s i −1 = s i or = 1, and s i r+1 = s i r or r = n. We check segments for different strings one by one. We store the longest segment's length as c max and the minimal index of the string that corresponds to this segment in j max . As in the sorting algorithm, in the second step of the algorithm, we apply the COMPARE_STRINGS subroutine for checking the equality of strings. Assume that we have the SORT_STRINGS(s) subroutine that implements the algorithm from Section 4.
Let us present the algorithm formally in Algorithm 2. Let us discuss the complexity of the algorithm. √ k) queries, and the error probability is constant. The total complexity is O(n(log n) √ k + n(log n) √ k) = O(n(log n) √ k). Error events of two steps are independent. Therefore, the error probability of the whole algorithm is also constant. We can achieve any required constant error probability by repetition. The technique is standard in both one-side and two-side errors. It can be seen, for example, in [16]. (i 1 , . . . , i n ) = ORDER ← SORT_STRINGS(s) We sort s = (s 1 , . . . , s n ).
if c > c max then If the current segment is longer than the current longest one Theorem 5. Suppose we have a constant ε such that 0 < ε < 3/4. If the length of the strings k ≥ log 2 n, then any quantum algorithm for the Most Frequent String Search problem has Ω( √ nk + n 3/4−ε ) query complexity. If k < log 2 n, then any quantum algorithm for the Most Frequent String Search problem has Ω( √ nk) query complexity Proof. Let us show that the problem is at least as hard as the unstructured search problem. Assume that n = 2t and k > 1 for some integer t. Then, let s t+1 , . . . , s 2t = 0 k , where 0 k is a string of k zeros. Other strings can be s 1 , . . . , s t = 1 k or there are z ∈ {1, . . . , t} and u ∈ {1, . . . , k} such that s z u = 0 and s z u = 1 for all u ∈ {1, . . . , u − 1, u + 1, . . . , k}. In the first case, the answer is 1 k . In the second case, the answer is 0 k . Therefore, solving the problem for this instance is equivalent to the search for 0 among the first tk = nk/2 variables.
According to [14], the quantum complexity of the unstructured search among nk/2 is Ω( √ nk).
In the case of odd n, we assign s n = 1 k/2 0 k/2 , and it is not used in the search. Then, we can consider only n − 1 strings. Thus, n − 1 is even.
Let us consider the case of k = 1. If n is odd, then s n = 2. Let s i = 0 for i ≥ t + 1, and t = n/2 . Let us consider two cases. The first one is s i = 1 for all i ∈ {1, . . . , t}. The second case is s i = 1 for all i ∈ {1, . . . , t}\{i 1 } and s i 1 = 0 for some i 1 ∈ {1, . . . , t}. In the first case, the answer is 1. In the second case, the answer is 0. Therefore, solving the problem for this instance is equivalent to the search for 0 among the first t = n/2 = nk/2 variables.
Let us show that the problem is at least as hard as the d-distinctness problem [47]. Let d be such that 1 4d = ε/2. Let b be the maximal integer that satisfies n ≥ b · (d − 1) + 1. Let u j be a binary representation of j for j ∈ {0, . . . , b}.
Assume that s 1 = u 1 for other strings. We have two cases: • Case 1. The sequence s contains d − 1 copies of each u j , where j ≥ 1 and other strings are u 0 . Formally: • Case 2. The sequence s contains d − 1 copies of each u j , where j ≥ 1 except some j m ∈ {2, . . . , b}; d copies of u j m and other strings are u 0 . Formally: In the first case, #(u j ) = d − 1 for j ∈ {1, . . . , b}, #(u 0 ) ≤ d − 1 and s 1 = u 1 . Therefore, the answer is 1. In the second case, #(u j ) = d − 1 for j ∈ {1, . . . , b}\{j m }, #(u 0 ) ≤ d − 1 and #(u j m ) = d. Therefore, the answer is i m = min{i : s i = u j m }. Note that i m = 1 because j m ≥ 2 and s 1 = u 1 = u j m .
Hence, solving the problem for this instance is equivalent to checking whether there is a string that occurs in the input at least d times. It is the d-distinctness problem from [47]. It is known that the complexity of the problem is Ω In our case, the complexity is Ω(n 3/4−ε ).
Secondly, let us discuss the classical complexity of the problem. Theorem 6. Any randomized algorithm for the Most Frequent String Search problem has Θ(nk) query complexity.
Proof. The best-known classical algorithm uses the radix sort algorithm and does steps similar to the steps of the quantum algorithm.
The running time of this algorithm is O(nk). At the same time, we can show that it is also a lower bound.
As it was shown in the proof of Theorem 5, the problem is at least as hard as the unstructured search problem among nk/2 variables. It is known [14] that the randomized complexity of the unstructured search among nk/2 variables is Ω(nk).

Intersection of Two Sequences of Strings Problem
Let us consider the following problem. Problem. For some positive integers n, m and k, we have the sequence of strings s = (s 1 , . . . , s n ). Each s i = (s i 1 , . . . , s i k ) ∈ Σ k for some finite size alphabet Σ. Then, we obtain m requests t = (t 1 . . . t m ), where t i = (t i 1 , . . . , t i k ) ∈ Σ k . The answer for a request t i is 1 if there is j ∈ {1, . . . , n} such that t i = s j . We should answer 0 or 1 to each of m requests.
Let us present the algorithm that is based on the sorting algorithm from Section 4. We sort strings from s. Then, we answer each request using a binary search in the sorted sequence of strings [34] and COMPARE_STRINGS quantum subroutine for strings comparison during the binary search.
Let us present Algorithm 3. Assume that the sorting algorithm from Section 4 is the subroutine SORT_STRINGS(s), and it returns the order ORDER = (i 1 , . . . , i n ). The subroutine BINARY_SEARCH_FOR_STRINGS(t i , s, ORDER) is the binary search algorithm with the COMPARE_STRINGS subroutine as a comparator, and it searches for t i in the ordered sequence (s i 1 , . . . , s i n ). Suppose that the subroutine BINARY_SEARCH_FOR_STRINGS returns 1 if it finds t and 0 otherwise.

Algorithm 3
The Quantum Algorithm for Intersection of Two Sequences of Strings Problem using sorting algorithm.
return ans end for The algorithms have the following query complexity.

Theorem 7. Algorithm 3 solves Intersection of Two Sequences of Strings Problem with query
complexity O((n + m) √ k · log n · log(n + m)) =Õ((n + m) √ k) and error probability O 1 n+m .
Proof. The correctness of the algorithm follows from the description. Because of Theorem 1, the sorting algorithm's complexity is O(n log n · √ k) and constant error probability.
Let queries, and the error probability is constant. Thus, the total complexity is O((n + m(log m + log log n)) √ k log n). Error events of the two steps are independent. Therefore, the error probability of the whole algorithm is also constant. We can achieve any required constant error probability by repetition.
The lower bound for the classical case can be proven using a result stated in [48] (Lemma 7, Section 5.1).

Theorem 8.
The randomized query complexity of Intersection of Two Sequences of Strings Problem is Θ((n + m)k).
For checking s i = t 1 , it is enough to check ¬ k j=1 (s i j = 1) because this implies s i j = 0 for all j ∈ {1, . . . , k}. In that case, checking for the existence of t 1 among s i is the same as checking the following condition: This condition means that not all string s i contains at least one 1.
The randomized complexity of computing ¬ k j=1 (s i j = 1) is the same as the complexity of the unstructured search for 1 among k variables, which is Ω(k). According to [48] (Lemma 7, Section 5.1), the total complexity of the function is Ω(nk). Assume that m > n. Let us consider s i = 0 k for all i ∈ {1, . . . , n}. The checking existence t j among s 1 , . . . , s n is at least as hard as the search for 1 among t j 1 , . . . , t j k that requires Ω(k) queries. It is true for all j ∈ {1, . . . , m}. Therefore, the total randomized complexity is Ω(mk).
Hence, if we join both cases, the randomized complexity of solving the problem is Ω(max (n, m) · k) = Ω((n + m) · k).
This complexity O((n + m)k) can be reached if we use the radix sort algorithm and perform the same operations as in the quantum algorithm.
Note that we can use the quantum algorithm for element distinctness [49,50] for this problem. The algorithm solves the problem of finding two identical elements in the sequence. The query complexity of the algorithm is O(D 2/3 ), where D is the number of elements in the sequence. The complexity is tight because of [51]. The algorithm can be the following. On j-th request, we can add the string t j to the sequence s 1 , . . . , s n and invoke the element distinctness algorithm that finds a collision of t j with other strings. Such approach requires Ω(n 2/3 √ k) queries for each request and Ω(mn 2/3 √ k) for processing all requests. Note that the online nature of requests does not allow us to access all t 1 , . . . , t m . Thus, each request should be processed separately.
In a case of n m, we can use the Grover search algorithm for searching t j among (s 1 , . . . , s n ). The complexity isÕ(m √ nk) in that case. Because of the probabilistic behavior of the Oracle, we should use the approach similar to [52] that uses ideas of Amplitude Amplification [53].

Conclusions
In the paper, we propose a quantum algorithm for a comparison of strings and a general idea for any algorithm that does A string comparison operations. Then, using these results, we construct a quantum strings sorting algorithm that works faster than the radix sort algorithm, which is the best known deterministic algorithm for sorting a sequence of strings.
We propose quantum algorithms for two problems using the sorting algorithm: the Most Frequent String Search and Intersection of Two String Sequences. These quantum algorithms are more efficient than classical (deterministic or randomized) counterparts in a case of log 2 (n) = o( √ k), where k is the length of strings and n is the number of strings. In a case of the Intersection of Two String Sequences problem, the condition is log 2 (n)(log 2 m + log 2 log 2 n) = o( √ k), where n and m are the number of strings in two sequences. Note that these assumptions are reasonable.
We discussed quantum and classical lower bounds for these problems. Classical lower bounds are tight, and at the same time, there is room to improve the quantum lower bounds.

Author Contributions:
The main idea and algorithms, K.K. and A.I.; lower bounds, J.V. and K.K.; constructions and concepts, K.K., A.I. and J.V. All authors have read and agreed to the published version of the manuscript.