Text Indexing for Regular Expression Matching

Finding substrings of a text T that match a regular expression p is a fundamental problem. Despite being the subject of extensive research, no solution with a time complexity significantly better than O(|T||p|) has been found. Backurs and Indyk in FOCS 2016 established conditional lower bounds for the algorithmic problem based on the Strong Exponential Time Hypothesis that helps explain this difficulty. A natural question is whether we can improve the time complexity for matching the regular expression by preprocessing the text T? We show that conditioned on the Online Matrix–Vector Multiplication (OMv) conjecture, even with arbitrary polynomial preprocessing time, a regular expression query on a text cannot be answered in strongly sublinear time, i.e., O(|T|1−ε) for any ε > 0. Furthermore, if we extend the OMv conjecture to a plausible conjecture regarding Boolean matrix multiplication with polynomial preprocessing time, which we call Online Matrix–Matrix Multiplication (OMM), we can strengthen this hardness result to there being no solution with a query time that is O(|T|3/2−ε). These results hold for alphabet sizes three or greater. We then provide data structures that answer queries in O( |T||p| τ ) time where τ ∈ [1, |T|] is fixed at construction. These include a solution that works for all regular expressions with Exp(τ · |T|) preprocessing time and space. For patterns containing only ‘concatenation’ and ‘or’ operators (the same type used in the hardness result), we provide (1) a deterministic solution which requires Exp ( τ · |T| log |T| ) preprocessing time and space, and (2) when |p| ≤ |T|z for z = 2o( √ log |T|), a randomized solution with amortized query time which answers queries correctly with high probability, requiring Exp ( τ · |T| 2Ω √ log |T| ) preprocessing


Introduction
The ability to search for substrings matching a regular expression in preprocessed text is useful in countless applications. This is evident from the multitude of popular regular expression engines that exist to facilitate this task. This list includes regular expression engines built into software packages and programming languages [1][2][3][4][5][6], those used within search engines for code repositories [7][8][9], and more generally, engines used for searching through string fields in database systems like SQL and non-relational databases [10][11][12]. In many of these cases, the text that we wish to search over is available long before any regular expression is provided. Based on this, one could hope that we could do much better than an algorithmic solution that does not take advantage of preprocessing the text. Despite a substantial effort, however, there has been little progress in finding solutions with good theoretical worst-case guarantees, and most often heuristic solutions are used, some of which are briefly described in Section 1.1. Let us now formalize the problem that we wish to solve. Problem 1. You are given a text T for polynomial-time preprocessing. Following preprocessing, queries are given in the form of a regular expression p. The response to this query should be the set {(i, j) | T[i, j] is matched by regular expression p}. The existential version of this problem asks whether this set is empty or not.
This paper approaches the problem from both sides of computational complexity. It provides a data structure that takes advantage of our ability to preprocess the text. It also establishes conditional lower bounds that help to explain the difficulty in deriving better solutions. We will show that, conditioned on a popular conjecture, there does not exist a O(|T| 1−ε ) query time solution for any ε > 0. Under a slightly stronger assumption, there does not exist O(|T| 3/2−ε ) query time solution for any ε > 0. The proofs are presented in Section 2.
We next discuss some of the problems used to prove the results stated above, along with the related conjectures and background. These reductions will all use a similar theme, that is, a connection between matching a regular expression to a text and the multiplication of two Boolean vectors. This connection is simply that for the inner product of two Boolean vectors to be 0, a 0 can be multiplied with a 0 or a 1, but a 1 cannot be multiplied with another 1. We will see in Section 2 that this behavior can be easily modeled using the or operator of a regular expression. These observations are evident in the original fine-grained hardness results for regular expression pattern matching appearing in [13]. Two problems based on the orthogonality of Boolean vectors are used in this work to manipulate the size of the query input and obtain different hardness results.
The first of these problems is the Online Boolean Matrix-Vector Multiplication problem. In this problem, matrix multiplication is over the Boolean semiring where matrix multiplication is defined as (Mv) i = ∨ n j=1 (M ij ∧ v j ). The formal definition is as follows: Problem 2 (Online Boolean Matrix-Vector Multiplication Problem (OMv)). You are given an n × n Boolean matrix M for polynomial-time preprocessing. Following preprocessing, n vectors v 1 , . . . , v n each of dimension n × 1 are given in an online fashion. After each vector v i is given, the vector Mv i (over the Boolean semiring) must be reported.
The following conjecture is used frequently in the field of fine-grained complexity.
This conjecture was first introduced in [14] and has grown in popularity in recent years. It has been used as the basis for results in dynamic graph problems among other works [15][16][17][18][19]. The best-known algorithms that the authors are aware of for OMv can be found in [20]. We now introduce a natural extension of the OMv problem and a stronger conjecture. Problem 3 (Online Boolean Matrix-Matrix Multiplication problem (OMM)). You are given an n × n Boolean matrix A for polynomial-time preprocessing. Following preprocessing, an n × n Boolean matrix B is given and the Boolean matrix AB must be reported.
Conjecture 2 (OMM Conjecture). The Online Boolean Matrix-Matrix Multiplication problem cannot be solved in strongly subcubic time, O(n 3−ε ) for any ε > 0, even with arbitrary polynomialtime preprocessing, using purely combinatorial methods.
Although OMM is very similar to Boolean Matrix Multiplication (BMM), where one has n × n matrices A and B and seeks their product over the Boolean semiring, it differs in that it allows for arbitrary polynomial-time preprocessing of one of the matrices. It, therefore, forms something of a combination of BMM and OMv.
Note that a strongly subcubic time algorithm for OMM does not imply a strongly subcubic time algorithm for OMv. Hence, the OMM conjecture being proven false would not tell us the validity of the OMv conjecture. However, a combinatorial subcubic time algorithm for OMv would imply a subcubic time algorithm for OMM. Hence the OMv conjecture being proven false (with a combinatorial subcubic time algorithm) would prove the OMM conjecture false as well. This makes the OMv conjecture a favorable assumption when it is possible to base results on it rather than OMM. In the case of the problems addressed in [14], a single vector suffices in their reduction where individual updates to the structure of interest are made, such as adding an edge or modifying an edge weight in a graph. Dynamic problems with batch updates are less frequently addressed, but they may be problems where the OMM conjecture is a better-suited conjecture. In particular, the OMM conjecture may be useful when the input size of a query needs to be manipulated to obtain stronger hardness results. To the best of our knowledge, research related to solving OMM focuses on utilizing the sparsity of one of the matrices [21,22].
Returning to the problem of indexing a text for regular expression queries, from the side of upper bounds, simple approaches like storing all precomputed solutions do not work as the space required cannot be bounded in terms of |T|. This is since |p| can be larger than |T|. To this end, we present three solutions to the existential version of the problem for constant-sized alphabets. The first one is a general solution with O( |T||p| τ ) query time, and Exp(τ|T|) space and preprocessing time (Exp( f (n))) means 2 O( f (n)) ), where τ ∈ [1, |T|] is a parameter fixed at construction time. To handle queries containing only concatenation and or operators, we provide (i) a solution with O( Our solutions also provide all of the starting and ending locations of matches for p. However, they do not give the correspondences between the two (which ending positions are for a given starting position). Our approach is based on constructing graph-based representations that map starting indices of substrings matched by the regular expression to ending indices of the substrings matched by the regular expression. We preprocess solutions to small regular expressions and provide a way in which these preprocessed solutions can be merged to form the final query response.

Background and Related Work
A regular expression p as defined in this paper is one of the following: the or operator '|' applied to two regular expressions p 1 and p 2 , the concatenation operator '•' applied to two regular expressions p 1 and p 2 , the star operator ' * ' applied to a regular expression p 1 , a symbol from the alphabet, the empty string, or the empty set. We call p 1 and p 2 subexpressions of p. We will not consider more advanced operators such as bracketed expressions that allow for a range of characters to be denoted within a few symbols, or captured groups that allow matched substrings to be recalled, or other added features that can make the problem NP-complete [23]. When considering the length of p, denoted |p|, one can view this as the number of vertices in the parse tree for p, as seen in Figure 1. The precedence-setting braces in the regular expression do not contribute to its length as they do not appear as any vertex in the parse tree. To avoid confusion, we will always refer to the length of the text T as |T| and the length of the pattern as |p|, thereby reserving n for the n × n matrices in Section 2. We will also always consider T as indexed from 1 to |T|.
As mentioned in the introduction, this problem has to be solved frequently in applications. As such, different heuristics have been proposed [24,25]. Most of these heuristics use the idea of multi-grams. Multi-grams are small portions of text that will match some part of a substring in the inputted expression. In practice, applications often use multi-grams. Furthermore, to the author's best knowledge, there are no solutions that preprocess the T and have a guaranteed time complexity less than that of any algorithmic solution. In the case where the pattern is given for preprocessing (rather than the text) work by Bille shows that after O(|p| log |p|) preprocessing time, each character in the text can be processed in O(|p| log w w + log |p|) time, where w is the word size [26].
For the algorithmic problem without any preprocessing, there exists a large body of research. The oldest and most fundamental solution is Thompson's Method from 1968 [27]. This method locates substrings matching the regular expression p in O(|T||p|) time via the simulation of a nondeterministic finite automaton equivalent to p. Results by Myers improve this to O( |T||p| log |T| + (|T| + |p|) log |T|) time [28]. This was further improved to O( |T||p| log log |T| log 1.5 |T| + |T| + |p|) time in [29]. Additionally, a result by Bille and Thorup says that if p consists of k strings then the algorithmic problem can be solved in time O(|T|k + |p| log k) [30]. In practice, for programming languages like Perl and Python, the simulation of this NFA is typically done using a method called back-tracking, which can lead to exponential time in the worst case. The main reason for this choice in implementation appears to be its simplicity [31]. To help answer why there have not been more significant advancements on the algorithmic problem, Backurs and Indyk established fine-grained lower bounds [13]. These lower bounds use the Strong Exponential Time Hypothesis (SETH). The proofs use a reduction from SETH to the problem of finding two orthogonal vectors, each from its own set (the Orthogonal Vectors problem). This is then reduced to pattern matching on regular expressions. The main idea in the final step is to use regular expressions to detect orthogonality between two vectors. The same technique is used in this paper in Section 2, although conditioned on a different conjecture. Both the work [13], as well as extensions of this work [32], focus on classifying regular expressions based on which types make pattern matching more difficult. We do not make such distinctions here.

Hardness of Creating an Index for Regular Expression Queries
The solution described in Section 3 requires exponential preprocessing and storage to see significant improvements in the query complexity over complete recomputation. In this section, we show that it is unlikely we will obtain a solution with polynomial preprocessing time and query time significantly better than O(|T| 3/2 ), and even more unlikely we obtain one with query time significantly better than O(|T|). All of the hardness results in this section hold for strings over an alphabet of size 3.

Hardness Based on OMv
The reduction here differs from the one in [13] in that the reduction here is from OMv (see Section 1 for a description of OMv) rather than the Orthogonal Vectors problem. This is done to add the notion of preprocessing. We adopt similar notation as was used there and consider this as the warm-up to the reduction used in Section 2.2.
Suppose the matrix M has the rows from top to bottom r 1 , . . . r n , where each r i is an 1 × n binary vector. We set the text T as the concatenation of the corresponding binary strings using the character 2 as a delimiter, that is show how to process an input vector v into a regular expression p. The component gadget is the same as the one from [13]. It is The definition CG(0) = (0 | 1) is motivated by the fact that in the inner product of two vectors, 0 can be multiplied by either a 0 or 1 while maintaining that the inner product is 0. The definition CG(1) = 0 is motivated by the fact that in the inner product of two vectors, 1 can only be multiplied by 0 while maintaining that the inner product is 0.
Next, we define the vector gadget as . Observe that our regular expression query pattern p has a length which is O(n). For example, VG(011 , the pattern p matches a substring starting at index 1 + (i − 1)(n + 1) in T if and only if r i · vs. = 0.
Proof. First assume r i · vs. = 0. If a component r i j = 1, we must have v j = 0. By setting CG(v j ) = 1, we match this particular character. For all characters r i k = 0 we set CG(v k ) = 0, thus all n characters in r i can be matched. This implies the substring for 2 • r i that starts at To prove the converse, assume r i · vs. = 0. Then there exists a component where r i j = 1 and v j = 1. Because CG(v j ) = 0, CG(v j ) cannot be made to match the 1 character in T corresponding to r i j . Additionally, because of the leading 2 symbols, this is the only character in the substring 2 • r i with which CG(v j ) could be matched. The two observations combined imply the substring 2 • r i that starts at 1 + (i − 1)(n + 1) cannot be matched with p.
Theorem 1. For all ε > 0, conditioned on the OMv conjecture, there does not exist an index that can answer regular expression queries in O(|T| 1−ε + (|p| + occ) 2−ε ) time, where occ is the size of the output. This holds even with arbitrary polynomial-time preprocessing.
Proof. The OMv conjecture is that for an n × n matrix M the OMv problem cannot be solved in O(n 3−ε ) time, even with polynomial preprocessing time. The input string T is of length Θ(n 2 ), and we receive a total of n vectors that correspond to n regular expression queries. If each of the regular expression queries could be solved in O(|T| 1−ε + (|p| + occ) 2−ε ) for ε/2 > 0 time (note that |p| and the output size occ is at most n = O( |T|)), then by Lemma 1 we can solve OMv in time O(n(|T| 1−ε + (|p| + occ) 2−ε )), which is O(n 3−ε ).
It is natural to ask if we are not using the full power of our hypothetical solution for regular expression queries. After all, our solution can potentially report matching substrings starting at every index. In the above reduction, if we removed the 2s acting as delimiters, it would compute the orthogonality of the input vector v with every starting position in the linearized M. This can be used to compute the cross-correlation between two vectors with non-negative entries if one is only concerned with whether a resulting entry is zero or non-zero. By reversing one of the vectors, the convolution can also be computed. Specifically, we can modify our reduction to have T represent an 1 × n binary vector v 1 and p the bottom-to-top reversal of a second n × 1 binary vector v 2 (both padded with at least n zeros on either side). This allows us to compute the convolution v 1 * v 2 .
By restricting the preprocessing time, we can obtain corollaries directly from the above reduction and the 3SUM hardness results established for the partial convolution indexing problem in [33]. Suppose along with the regular expression p we are given a set S of indices where we wish to know if a match starts. Call this the Partial Regular Expression Query problem. The following corollary is immediate from [33].

Hardness Based on OMM
We will now utilize the OMM conjecture to strengthen our results. The reduction in this section requires only a single query pattern of size Θ(n 2 ). This allows us to obtain a lower bound stronger than those obtained under OMv or 3SUM, but still conditioned on a plausible conjecture. See Section 1 for a description of the OMM conjecture.
Again, let r 1 , . . ., r n be the rows of the input matrix A that we can preprocess. We make The substrings 1 n will allow us to deduce from each ending index of a match the vector v j used in the construction of p that is responsible for that match. As we will see, this is since a subexpression in p built from a vector v j has the suffix 1 j . Note that |T| = Θ(n 2 ).
From the matrix B supplied at run time, we construct a regular expression p. Let B have the columns v 1 , . . ., v n from left to right. Using the same vector gadgets from Section 2.1, we set

Lemma 2.
For i, j ∈ [1, n] there exists a substring matched by regular expression p starting in T at index (i − 1)(2n + 1) + 1 and a substring matched ending at index (i − 1)(2n + 1) Proof. First assume C ij = 0. This implies that r i · v j = 0. By the argument given in the proof for Lemma 1, this implies the subexpression 2 • VG(v j ) matches the substring 2 • r i of T, which now starts at index (i − 1)(2n + 1) + 1. Moreover, we can match the subexpression 1 j of 2 • VG(v j ) • 1 j with the first j 1's that follow 2 • r i in T, giving the final index of the match as (i − 1)(2n + 1) + 1 + n + j.
The output size of the query is bound by n 2 since each of the n subexpressions of p of form 2 • VG(v j ) • 1 j match T in at most n places. Because the output size does not exceed our desired lower bound we can apply Lemma 2 and the fact that |p| = Θ(n 2 ) to obtain Theorem 2.

Theorem 2.
For all ε > 0, conditioned on the OMM conjecture, there does not exist an index that can answer regular expression queries in O((|T| + |p| + occ) 3/2−ε ), where occ is the size of the output. This holds even with arbitrary polynomial-time preprocessing.

A Regular Expression Index
Before discussing our graph-based representations for query solutions we make a brief observation regarding the parse tree constructed from the regular expression. Within this rooted tree, each vertex represents the regular expression obtained by applying the operator labeling the vertex to the subexpressions represented by its children. The order in which these operators are applied is defined by the structure of the tree (see Figure 1). The leaves of the tree are either the empty string, the empty set, or symbols from the string's alphabet. The ideas we present next are based on being able to quickly merge the solutions for each vertex of the tree.

Solution Graphs: Pattern Matching via Reachability
We take a slightly unorthodox view of regular expressions in this section. This viewpoint will make it easier to precompute solutions to small regular expression queries and then merge them with other solutions to answer the query. Essentially, we will view the solution for a regular expression query with regular expression p as a directed graph G where we are mainly concerned with the reachability from vertices in a set V 1 to vertices in another set V 2 . A solution graph for a given text T and regular expression p is defined as follows: Definition 1 (Solution Graph). For a text T and regular expression p, a solution graph G is a directed graph that contains two distinguished sets of vertices V 1 = {v 1 1 , . . . , v 1 |T|+1 } and V 2 = {v 2 1 , . . . , v 2 |T|+1 }, called the start and end vertices, respectively. In G there exists a path from v 1 i ∈ V 1 to a vertex v 2 j+1 ∈ V 2 if and only if the substring T[i, j] matches the regular expression p. Correspondingly, we will say that the regular expression given by the empty string matches the substring T[i, i − 1].
Solution graphs for a text and pattern are not unique. Moreover, there will typically be many additional vertices lying on the paths between the V 1 and V 2 , but we ultimately only care about the reachability between these two sets. Note also that this representation may require much more information than our final solution, which itself must only provide a list of starting positions and ending positions. We will demonstrate how these graphs can be merged together at every step while only using O(|T|) time for each vertex of the parse tree.

Merging Solution Graphs
We now describe how solution graphs for subexpressions can be merged based on the operators used. We use this procedure both in the construction and in the query phase. It forms the key component of our technique. Illustrations are provided in Figures 2 and 3. If the regular expression is only a single symbol, the solution graph is easily obtained as a bipartite graph, e.g., the top row of Figure 2.
First we consider the or operator '|'. Let p 1 be one regular expression with solution graph G 1 that has start and end vertex sets V 1 and V 2 . Let p 2 be a regular expression with solution graph G 2 that has start and end vertex sets U 1 and U 2 . To compute the graph G corresponding to p 1 |p 2 , do the following:

1.
Graph G is initially equal to the two disconnected graphs G 1 and G 2 .

3.
Add the directed edges: For the concatenation operator '•' we again take G 1 as the solution graph for pattern p 1 with start and end vertex sets V 1 and V 2 , respectively, and G 2 as the solution graph for pattern p 2 with start and end vertex sets U 1 = {u 1 1 , . . . , u 1 |T|+1 } and U 2 = {u 2 1 , . . . , u 2 |T|+1 }. To create the solution graph G for p 1 • p 2 , do the following:

1.
Graph G is initially equal to the two disconnected graphs G 1 and G 2 .

2.
Make V 1 the start vertex set of G and U 2 the end vertex set of G.

3.
Add the directed edges:  Lastly, we consider the star operator ' * ' operator. Again taking V 1 , V 2 as the starting and ending vertex sets of G 1 for p 1 , construct the solution graph G for p * 1 as follows: 1.
Graph G is initially equal to G 1 .

3.
Add the directed edges: . In each step above Θ(|T|) edges are added to the resulting solution graph. Furthermore, any solution graphs that arise from matching a pattern consisting of only symbols (the leaves of the parse tree) have only O(|T|) edges. Therefore, a regular expression of length |p| results in a solution graph having O(|T||p|) edges.

Lemma 3.
The constructions of G for the or (p 2 |p 2 ), concatenation (p 1 • p 2 ), and star (p * 1 ) operators result in G being a solution graph for the corresponding regular expression.
Proof. The first two of these are straight forward, but we include them for completeness.
(i) For the or operator, first consider when there exists a substring T[i, j] matched by p 1 | p 2 . Then the substring T[i, j] must be matched by p 1 or p 2 . Say WLOG it is matched by p 1 , then there exists a path from v 1 i ∈ V 1 to v 2 j+1 ∈ V 2 , which implies a path from w 1 i ∈ W 1 to w 2 j+1 ∈ W 2 in G due to the edges (w 1 i , v 1 i ) and (v 2 j+1 , w 2 j+1 ). In the other direction, if there exists a path from w 1 i to w 2 j+1 , it must start with either the edge (w 1 Hence, there exists a path from v 1 i to v 2 k in G 1 and a path from u 1 k to u 2 j+1 in G 2 . Connecting these paths using the edge (v 2 k , u 1 k ) we obtain the desired path.
In the other direction, if there exists a path from v 1 i to u 2 j+1 then there must exist a path from v 1 i to some v 2 k and then a path from u 1 k to u 2 j+1 , which implies T[i, k − 1] is matched by p 1 and T[k, j] is matched by p 2 .
(iii) For the star operator, Figure 2 is particularly helpfully in understanding the argument.
We will use induction on x to show that all substrings T[i, j] matched by x ≥ 1 concatenations of p 1 correspond to a path from v 1 i , w 2 i ) provides a path. If x > 0, then by the preceding paragraph we can use the edge (w 1 i , v 1 i ), the path from v 1 i to v 2 j+1 and the edge (v 2 j+1 , w 2 j+1 ) to obtain the desired path. In the other direction, if there exists a path from w 1 i to w 2 j+1 , and j + 1 = i, then the corresponding substring is T[i, i − 1] and so the substring is the empty string which is matched by p * 1 . Otherwise j + 1 > i, in which case the design of G implies there must be a path from v 1 i to v 2 j+1 . Along this path any edges from V 1 to V 2 correspond to a substring of T matched by p 1 , and any edges from V 2 to V 1 correspond to the concatenation of two of these substrings. Putting these together we get that the substring T[i, j] is the concatenation of substrings matched by p 1 , and is therefore matched by p * 1 .

Precomputed Solutions
Having established how the solutions to sub-problems can be merged and pushed up to the root of our regular expression parse tree, we next describe how to use this fact to create solutions with the desired preprocessing/query time trade-off.
Solution Graphs for the Leaves. Each leaf represents either the empty set, the empty string, or an alphabet symbol. The empty set graph will have no edges, whereas the empty string has the edges (v 1 i , v 2 i ) for 1 ≤ i ≤ |T| + 1. For leaves with alphabet symbols, the solution graphs have the edges (v 1 i , v 2 i+1 ) whenever T[i] matches the character labeling that leaf vertex.
Sampling Subtrees. We start with the observation that the number of vertices in the parse tree can be made linear in the number of leaves, despite vertices with the ' * ' label having only a single child. This is thanks to the property (p * ) * = p * , which means that we can contract one of the star operator vertex if it appears as a child of another star operator vertex.
For a given τ value we can consider all regular expressions of length τ. With constant alphabet sizes, there are 2 O(τ) such regular expressions. For each of these, we will compute the solution graph for the root of its parse tree. For smaller patterns, this means that the solution graph for the whole parse tree will already be computed and the query response can be immediately given upon seeing the query. We store additional structures to deal with larger patterns. We store a bipartite graph that captures the reachability of start to end vertex sets in the solution graph and contains O(|T| 2 ) edges. An example of this can be seen in Figure 3. The bipartite graph is stored in the form of a Boolean adjacency matrix.
In more detail, suppose one has a bipartite graph G with vertex partitions W 1 and W 2 , each of size |T|. Then given a subset S of vertices in W 1 , determining which vertices in W 2 are reachable from S can be done using the multiplication of a Boolean matrix M of size |T| 2 against a vector s of size |T| × 1. The matrix M would have a 1 in entry (i, j) iff u i ∈ W 2 is reachable from v j ∈ W 1 . The vector s has a 1 for every entry in S and a 0 otherwise. The resulting product M s shows the reachable vertices in W 2 . The space required for storing all of these matrices is O(|T| 2 · 2 O(τ) ).

Constructing the Solution Graph at Query Time
A solution to a query is now a regular expression parse tree where every subtree of size τ or less has been replaced with a single vertex whose solution graph has been precomputed and is represented with a Boolean matrix. To obtain this final solution graph we proceed as described in the last section, recursively computing the solution until we obtain the graph at the root of the full parse tree. In doing so, we treat the precomputed solution graphs as merely a set of start and end vertices.
The total time to do this merging and obtain the final solution graph is O(

Querying Matches via Graph Traversal
Adding an additional start vertex s and edges {(s, v 1 i ) | 1 ≤ i ≤ |T| + 1} (see Figure 4), the query from s to detect reachability to the final set of ending vertices can be done by a graph traversal. During a traversal, when a subgraph represented by a Boolean matrix is arrived at, matrix multiplication can be performed (when beneficial) to more efficiently traverse the subgraph. Figure 4. On the (left), we have the querying strategy to find all ending vertices corresponding to pattern matches. On the (right), we reverse the edges in G to form G R . The graph G R can be used to find starting vertices that correspond to the starting indices of pattern matches.
For general regular expressions, the star operator allows for the same subgraph to be visited multiple times. In the worst case, we return up to |T| times to a subgraph where we may have the Boolean matrix representation. Rather than use the Boolean matrix multiplication, we take the adjacencies of each starting vertex. This results in a query time of O( |T| 2 |p| τ ). We reparameterize by setting τ = Θ(τ · |T|). This makes the query time complexity O |T|·|p| τ and changes the space from |T| 2 2 O(τ) to Exp(τ |T|).

Theorem 3.
There exists a solution to answer existential regular expression queries with query time O( |T|·|p| τ ), requiring Exp(τ|T|) space and preprocessing time.

Restricted Regular Expressions
If we restrict ourselves to patterns with only concatenation and or operators (restricted regular expressions), using breadth-first-search all 'start' vertices of a subgraph are reached on the same level of the search. Hence for restricted regular expressions, we can better utilize the Boolean matrix representations of subgraphs, taking advantage of techniques developed for performing this multiplication more efficiently. Based on whether we desire a deterministic solution or a solution that is correct whp we obtain different space and preprocessing times. [34]). For all ε ∈ (0, 1/2), every n × n Boolean matrix A can be preprocessed in O(n 2+ε ) time such that every subsequent multiplication of A with an arbitrary Boolean n-vector x can be performed in O(n 2 /(ε log n) 2 ) time, on a pointer machine or a (log n)-word RAM.

Lemma 4 (Williams
Applying Lemma 4 with ε = 1/4, the query time complexity becomes We again reparameterize, setting τ = Θ τ 1 + |T| The added time complexity of this operation is O(|T| 3 ) in total per matrix. This step is necessary to ensure the amortization techniques used in [20] will work. The traversal is done in the same way as in the deterministic solution, only that each matrix-vector multiplication is repeated multiple times, and the most frequent of these solutions is taken as the output vector. We will show that under the conditions of |P| ≤ |T| z for some z = 2 o( √ log |T|) , to maintain high probability of success of the overall algorithm it is sufficient to perform z · log |T| repetitions of each matrix multiplication. Therefore, each matrix during a query requires (z · |T| 2 log |T|)/2 Ω( √ log |T|) time. That yields an amortized query time of where queries succeed whp. We reparameterize, setting τ = Θ τ · 1 +

Preserving High Probability of Success
We know that our matrix-vector multiplication algorithm from [20] succeeds with probability 1 − 1 |T| ε for some ε > 0. Our algorithm performs matrix-vector multiplication up to |p| τ ≤ |p| times. We assume the correctness of these matrix multiplications is independent. In the worst case, each of these has to be correct for our final answer to be correct. The probability that these are all correct is given by for |p| τ > 1 and 1 |T| ε < 1.
Using the assumption that |p| ≤ |T| z , this is bound below by 1 − 1 |T| ε−z . We will aim to preserve success whp by 'amplifying' ε to some ε > z. The technique we will use to do this is repeating each matrix multiplication a sufficient number (denoted by k) of times and using the most frequently resulting vector (the mode) as the solution.
Let X i be the random variable which is 1 if the multiplication on trial i is correct and 0 otherwise. Let 1 − 1 |T| ε =: q > 1 2 be the probability with which the original matrix-vector multiplication algorithm is correct. The probability that the mode is correct is greater or equal to the probability that the correct solution is outputted at least k 2 times. We write the latter quantity as one minus the probability that the correct answer gets outputted less than half the time. Applying Chernoff bound, we get Recall that we want our probability to be at least 1 − 1 |T| ε for some ε > z. Therefore we set the right-hand-side to 1 − 1 |T| ε and solve for ε to obtain ε = k 2q ln |T| q − 1 2 2 .
Since we want ε > z, it suffices then that we find the number of trials k such that k 2q ln |T| q − 1 2 2 > z or, equivalently k > 2qz ln |T| q − 1 2 2 .
Obtaining Starting and Ending Positions. For each precomputed solution, we consider the edges reversed and the corresponding adjacency matrix as being precomputed. After constructing the final solution graph, we reverse all of its edges and add a single source vertex t and edges {(v 2 i , t) | 1 ≤ i ≤ |T| + 1} (see Figure 4). We call this graph G R . Using the same traversal techniques on G R as used on G yields the set of reachable starting vertices in V 1 .

Discussion
This work contributes to the growing number of fine-grained hardness results for string-related problems. The hardness results presented here belong to a narrower set of hardness results for indexing problems. We also provide here a solution to the indexing problem by utilizing recent innovations in Boolean matrix multiplication. Although requiring an exponential amount of space in terms of the length of the text, this solution allows for faster query times. A future research direction is to more elegantly incorporate the star operator into the above ideas, allowing for more space-efficient indexes that answer queries containing this operator.