Subpath Queries on Compressed Graphs: a Survey

Text indexing is a classical algorithmic problem that has been studied for over four decades: given a text $T$, pre-process it off-line so that, later, we can quickly count and locate the occurrences of any string (the query pattern) in $T$ in time proportional to the query's length. The earliest optimal-time solution to the problem, the suffix tree, dates back to 1973 and requires up to two orders of magnitude more space than the plain text just to be stored. In the year 2000, two breakthrough works showed that efficient queries can be achieved without this space overhead: a fast index be stored in a space proportional to the text's entropy. These contributions had an enormous impact in bioinformatics: nowadays, virtually any DNA aligner employs compressed indexes. Recent trends considered more powerful compression schemes (dictionary compressors) and generalizations of the problem to labeled graphs: after all, texts can be viewed as labeled directed paths. In turn, since finite state automata can be considered as a particular case of labeled graphs, these findings created a bridge between the fields of compressed indexing and regular language theory, ultimately allowing to index regular languages and promising to shed new light on problems such as regular expression matching. This survey is a gentle introduction to the main landmarks of the fascinating journey that took us from suffix trees to today's compressed indexes for labeled graphs and regular languages.


Introduction
Consider the classic algorithmic problem of finding the occurrences of a particular string P (a pattern) in a text T. Classic algorithms such as Karp-Rabin's [1], Boyer-Moore's [2], and Knuth-Morris-Pratt's [3] are optimal when both the text and the pattern are part of the query: those algorithms scan the text and find all occurrences of the pattern in linear time. What if the text is known beforehand and only the patterns to be found are part of the query? in this case, it is conceivable that preprocessing T off-line into a fast and small data structure (an index) might be convenient over the above on-line solutions. As it turns out, this is exactly the case. The full-text indexing problem (where full refers to the fact that we index the full set of T's substrings) has been studied for over forty years and has reached a very mature and exciting point: modern algorithmic techniques allow us to build text indexes taking a space close to that of the compressed text and able to count/locate occurrences of a pattern inside it in time proportional to the pattern's length. Note that we have emphasized two fundamental features of these data structures: query time and index space. As shown by decades of research on the topic, these two dimensions are, in fact, strictly correlated: the structure exploited to compress text is often the same that can be used also to support fast search queries on it. Recent research has taken a step forward, motivated by the increasingly complex structure of modern massive datasets: texts can be viewed as directed labeled path graphs and compressed indexing techniques can actually be generalized to more complex graph topologies. While compressed text indexing has already been covered in the literature in excellent books [4,5] and surveys [6][7][8], the generalizations of these advanced techniques to labeled graphs, dating back two decades, lack a single point of reference despite having reached a mature state-of-the-art. The goal of this survey is to introduce the (non-expert) reader to the fascinating field that studies compressed indexes for labeled graphs.
We start in Section 2 with a quick overview of classical compressed text indexing techniques: compressed suffix arrays and indexes for repetitive collections. This section serves as a self-contained warm-up to fully understand the concepts contained in the next sections. Section 3 starts with an overview of the problem of compressing graphs, with a discussion on known lower bounds to the graph indexing problem, and with preliminary solutions based on hypertext indexing. We then introduce prefix sorting and discuss its extensions to increasingly complex graph topologies. Subsection 3.5 is devoted to the problem of indexing trees, the first natural generalization of classical labeled paths (strings). Most of the discussion in this section is spent on the eXtended Burrows-Wheeler Transform (XBWT), a tree transformation reflecting the co-lexicographic order of the root-to-node paths on the tree. We discuss how this transformation naturally supports subpath queries and compression by generalizing the ideas of Section 2. Section 3.6 adds further diversity to the set of indexable graph topologies by discussing the cases of sets of disjoints cycles and de Bruijn graphs. All these cases are finally generalized in Section 3.7 with the introduction of Wheeler graphs, a notion capturing the idea of totally-sortable labeled graph. This section discusses the problem of compressing and indexing Wheeler graphs, as well as recognizing and sorting them. We also spend a paragraph on the fascinating bridge that this technique builds between compressed indexing and formal language theory by briefly discussing the elegant properties of Wheeler languages: regular languages recognized by finite automata whose state transition is a Wheeler graph. In fact, we argue that a useful variant of graph indexing is regular language indexing: in many applications (for example, computational pan-genomics [9]) one is interested in indexing the set of strings read on the paths of a labeled graphs, rather than indexing a fixed graph topology. As a matter of fact, we will see that the complexity of those two problems is quite different. Finally, Section 3.8 further generalizes prefix-sorting to any labeled graph, allowing us to index any regular language. The key idea of this generalization is to abandon total orders (of prefix-sorted states) in favor of partial ones. We conclude our survey in Section 4 with a list of open challenges in the field.

Terminology
A string S of length n over alphabet Σ is a sequence of n elements from Σ. We use the notation S[i] to indicate the i-th element of S, for 1 ≤ i ≤ n. Let a ∈ Σ and S ∈ Σ * . We write Sa to indicate the concatenation of S and a. A string can be interpreted as an edge-labeled graph (a path) with n + 1 nodes connected by n labeled edges. In general, a labeled graph G = (V, E, Σ, λ) is a directed graph with set of nodes V, set of edges E ⊆ V × V, alphabet Σ, and labeling function λ : E → Σ. Quantities n = |V| and e = |E| will indicate the number of nodes and edges, respectively. We assume to work with an effective alphabet Σ of size σ, that is, every c ∈ Σ is the label of some edge. In particular, σ ≤ e. We moreover assume the alphabet to be totally ordered by an order we denote with ≤, and write a < b when a ≤ b and a = b. In this survey we consider two extensions of ≤ to strings (and, later, to labeled graphs). The lexicographic order of two strings aS and a S is defined as a S < aS if and only if either (i) a < a or (ii) a = a and S < S hold. The empty string is always smaller than any non-empty string. Symmetrically, the co-lexicographic order of two strings Sa and S a is defined as S a < Sa if and only if either (i) a < a or (ii) a = a and S < S hold.
This paper deals with the indexed pattern matching problem: preprocess a text so that, later all text occurrences of any query pattern Π ∈ Σ m of length m can be efficiently counted and located. These queries can be generalized to labeled graphs; we postpone the exact definition of this generalization to Section 3.

The Labeled Path Case: Indexing Compressed Text
Consider the text reported in Figure 1. For reasons that will be clear later, we append a special symbol $ at the end of the text, and assume that $ is lexicographically smaller than all other characters. Note that we assign a different color to each distinct letter. The question is: how can we design a data structure that permits to efficiently find the exact occurrences of (short) strings inside T? In particular, one may wish to count the occurrences of a pattern (for example, count(T, "AT") = 3) or to locate them (for example, locate(T, "AT") = 1, 3, 7). As we will see, there are two conceptually different ways to solve this problem. The first, sketched in subsection 2.1, is to realize that every occurrence of Π = "AT" in T is a prefix of a text suffix (for example: occurrence at position 7 of "AT" is a prefix of the text suffix "AT$"). The second, sketched in subsection 2.2, is to partition the text into non-overlapping phrases appearing elsewhere in the text and divide the problem into the two sub-problems of (i) finding occurrences that overlap two adjacent phrases and (ii) finding occurrences entirely contained in a phrase. Albeit conceptually different, both approaches ultimately resort to suffix sorting: sorting lexicographically a subset of the text's suffixes. The curious reader can refer to the excellent reviews of Mäkinen and Navarro [6] and Navarro [7,8] for a much deeper coverage of the state-of-the-art relative to both approaches.

The Entropy Model: Compressed Suffix Arrays
The sentence every occurrence of Π is the prefix of a suffix of T leads very quickly to a simple and time-efficient solution to the full-text indexing problem. Note that a suffix can be identified by a text position: for example, suffix "GAT$" corresponds to position 6. Let us sort text suffixes in lexicographic order. We call the resulting integer array the suffix array (SA) of T, see Figure 2. This time, we color text positions i according to the color of letter T[i]. Note that colors (letters) get clustered, since suffixes are sorted lexicographically. The reason for appending $ at the end of the text is that, in this way, no suffix prefixes another suffix. Suffix arrays were independently discovered by Udi Manber and Gene Myers in 1990 [10] and (under the name of PAT array) by Gonnet, Baeza-Yates and Snider in 1992 [11,12]. An earlier solution, the suffix tree [13], dates back to 1973 and is more time-efficient, albeit much less space efficient (by a large constant factor). The idea behind the suffix tree is to build the trie of all text's suffixes, replacing unary paths with pairs of pointers to the text.
The suffix array SA, when used together with the text T, is a full-text index: in order to count/locate occurrences of a pattern Π, it is sufficient to binary search SA, extracting characters from T to compare Π with the corresponding suffixes during search. This solution permits to count the occ occurrences of a pattern Π of length m in a text T of length n in O(m log n) time, and to report them in additional optimal O(occ) time. Letting the alphabet size being denoted by σ, our index requires n log σ + n log n bits to be stored (the first component for the text and the second for the suffix array).
Note that the suffix array cannot be considered a small data structure. On a constant-sized alphabet, the term n log n is asymptotically larger than the text. The need for processing larger and larger texts made this issue relevant by the end of the millennium. As an instructive example, consider indexing the Human genome (≈ 3 billion DNA bases). Being the DNA a string over an alphabet of size four ({A, C, G, T}), the Human genome takes less than 750 MiB of space to be stored using two bits per letter. The suffix array of the Human genome requires, on the other hand, about 11 GiB.
Although the study of compressed text indexes had begun before the year 2000 [14], the turn of the millennium represented a crucial turning point for the field: two independent works by Ferragina and Manzini [15] and Grossi and Vitter [16] showed how to compress the suffix array.
Consider the integer array ψ of length n defined as follows. Given a suffix array position i containing value (text position) j = SA   Let us now store just the differences between consecutive values in each increasing subsequence of ψ. We denote this new array of differences as ∆(ψ). See Figure 4 for an example.  Two remarkable properties of ∆(ψ) are that, using an opportune encoding for its elements, this sequence: 1. supports accessing any ψ[i] in constant time [16], and 2. can be stored in nH 0 + O(n) bits of space, where H 0 = ∑ c∈Σ (n c /n) log(n/n c ) is the zero-order empirical entropy of T and n c denotes the number of occurrences of c ∈ Σ in T.
The fact that this strategy achieves compression is actually not hard to prove. Consider any integer encoding (for example, Elias' delta or gamma) capable to represent any integer x in O(1 + log x) bits. By the concavity of the logarithm function, the inequality holds for any integer sequence x 1 , . . . , x m . Now, note that the sub-sequence x 1 , . . . , x n c of ∆(ψ) corresponding to letter c ∈ Σ has two properties: it has exactly n c terms and ∑ n c i=1 x i ≤ n. It follows that the encoded sub-sequence takes (asymptotically) ∑ n c i=1 (1 + log(x i )) ≤ n c · log(n/n c ) + n c bits. By summing this quantity over all the alphabet's characters, we obtain precisely O (n(H 0 + 1)) bits. Other encodings (in particular, Elias-Fano dictionaries [17,18]) can achieve the claimed nH 0 + O(n) bits while supporting constant-time random access.
The final step is to recognize that ψ moves us forward (by one position at a time) in the text. This allows us to extract suffixes without using the text. To achieve this, it is sufficient to store in one array F = $AAAAGTTT (using our running example) the first character of each suffix. This can be done in O(n) bits of space (using a bitvector) since those characters are sorted and we assume the alphabet to be effective. The i-th text suffix (in lexicographic order) is then . . , where ψ ( ) indicates function ψ applied times to its argument. See Figure 5. Extracting suffixes makes it possible to implement the binary search algorithm discussed in the previous section: the Compressed Suffix Array (CSA) takes compressed space and enables us finding all pattern's occurrences in a time proportional to the pattern length. By adding a small sampling of the suffix array, one can use the same solution to compute any value SA[i] in polylogarithmic time without asymptotically affecting the space usage.
Another (symmetric) philosophy is to exploit the inverse of function φ. These indexes are based on the Burrows-Wheeler transform (BWT) [19] and achieve high-order compression [15]. We do not discuss here BWT-based indexes since their generalizations to labeled graphs will be covered in Section 3. For more details on entropy-compressed text indexes, we redirect the curious reader to the excellent survey of Mäkinen and Navarro [6].

The Repetitive Model
Since their introduction in the year 2000, entropy-compressed text indexes have had a dramatic impact in domains such as bioinformatics: the most widely used DNA aligners, Bowtie [20] and BWA [21], are based on compressed suffix arrays and can align thousands of short DNA fragments per second on large genomes while using only compressed space in RAM during execution. As seen in the previous section, these indexes operate within a space bounded by the text's empirical entropy [15,16]. Entropy, however, is insensitive to long repetitions: the entropy-compressed version of T · T (that is, text T concatenated with itself) takes at least twice the space of the entropy-compressed T [22]. There is a simple reason for this fact: by its very definition (see Section 2.1), the quantity H 0 depends only on the characters' relative frequencies in the text. Since characters in T and T · T have the same relative frequencies, it follows that their entropies are the same. The thesis follows, being the length of T · T twice the length of T.
While the above reasoning might seem artificial, the same problem arises on texts composed of large repetitions. Nowadays, most large data sources such as DNA sequencers and the web follow this new model of highly repetitive data. As a consequence, entropy-compressed text indexes are no longer able to keep pace with the exponentially-increasing rate at which this data is produced, as very often the mere index size exceeds the RAM limit. For this reason, in recent years more powerful compressed indexes have emerged; these are based on the Lempel-Ziv factorization [22], the run-length Burrows-Wheeler Transform (BWT) [23][24][25], context-free grammars [26], string attractors [27,28] (combinatorial objects generalizing those compressors), and more abstract measures of repetitiveness [29]. The core idea behind these indexes (with the exception of the run-length BWT, read Section 3) is to partition the text into phrases that occur also elsewhere and to use geometric data structures to locate pattern occurrences.
To get a more detailed intuition of how these indexes work, consider the Lempel-Ziv '78 (LZ78) compression scheme. The LZ78 factorization breaks the text into phrases with the property that each phrase extends by one character a previous phrase, see the example in Figure 6 (top). The mechanism we are going to describe works for any factorization-based compressor (also called dictionary compressors). A factorization-based index keeps a two-dimensional geometric data structure storing a labeled point (x, y, ) for each phrase y, where x is the reversed text prefix preceding the phrase and is the first text position of the phrase. Efficient techniques not discussed here (in particular, mapping to rank space) exist to opportunely reduce x and y to small integers preserving the lexicographic order of the corresponding strings (See, for example, Kreft and Navarro [22]). For example, such a point in Figure 6 is (GCA, CG, 3), corresponding to the phrase break between positions 3 and 4. To understand how locate queries are answered, consider the pattern Π = CAC. All occurrences of Π crossing a phrase boundary can be split into a prefix and a suffix, aligned on the phrase boundary. Let us consider the split CA|C (all possible splits have to be considered for the following algorithm to be complete). If Π occurs in T with this split, then C will be a prefix of some phrase y, and CA will be a suffix of the text prefix preceding y. We can find all phrases y with this property by issuing a four-sided range query on our grid as shown in Figure 6 (bottom left). This procedure works since C and AC (that is, CA reversed) define ranges on the horizontal and vertical axes: these ranges contain all phrases prefixed by C and all reversed prefixes prefixed by AC (equivalently, all prefixes suffixed by CA), respectively. In Figure 6 (bottom left), the queried rectangle contains text positions 13 and 14. By subtracting from those numbers the length of the pattern's prefix (in this example, 2 = |CA|), we discover that Π = CAC occurs at positions 11 and 9 crossing a phrase with split CA|C.
defines the query, while each phrase generates a point that is stored permanently in the geometric structure.
With a similar idea (that is, resorting again to geometric data structures), one can recursively track occurrences completely contained inside a phrase, see the caption of Figure 6 (bottom right). Let b be the number of phrases in the parse. Note that our geometric structures store overall O(b) points. On repetitive texts, the smallest possible number b of phrases of such a parse can be constant. In practice, on repetitive texts b (or any of its popular approximations [30]) is orders of magnitude smaller than the entropy-compressed text [22,23]. The fastest existing factorization-based indexes to date can locate all occ occurrences of a pattern Π ∈ Σ m in O(b polylog n) bits of space and optimal O(m + occ) time [31]. For more details on indexes for repetitive text collections, the curious reader can refer to the excellent recent survey of Navarro [7,8].

Indexing Labeled Graphs and Regular Languages
Interestingly enough, most techniques seen in the previous section extend to more structured data. In particular, in this section we will work with labeled graphs, and extend these results to regular languages by interpreting finte automata as labeled graphs. We recall that a labeled graph is a quadruple (V, E, Σ, λ) where V is a set of n = |V| nodes, E ⊆ V × V is a set of e = |E| directed edges, Σ is the alphabet and λ : E → Σ is a labeling function assigning a label to each edge.
sequence of k edges such that adjacent edges share a node). We extend function λ to paths as λ( The starting point is to observe that texts are nothing but labeled path graphs. As it turns out, there is nothing special about paths that we cannot generalize to more complex topologies. We adopt the following natural generalizations of count and locate queries: • Count: given a pattern (a string) Π ∈ Σ m , return the number of nodes reached by a path labeled with Π. • Locate: given a pattern Π, return a representation of all nodes reached by a path labeled with Π.
We will refer to the set of these two queries with the term subpath queries. The node representation returned by locate queries could be an arbitrary numbering. This solution, however, has the disadvantage of requiring at least n log n bits of space just for the labels, n being the number of nodes. In the following subsections we will discuss more space-efficient solutions based on the idea of returning a more regular labeling (e.g. the DFS order of the nodes).

Graph Compression
We start with a discussion of existing techniques for compressing labeled graphs. Lossless graph compression is a vast topic that has been treated more in detail in other surveys [32,33]. Since this survey deals with subpath queries on compressed graphs, in this paragraph we only discuss compression techniques that have been shown to support these queries on special cases of graphs (mainly paths and trees).
Note that, differently from the string domain, the goal is now to compress two components: the labels and the graph's topology. Labels can be compressed by extending the familiar entropy model to graphs. The idea here is to simply count the frequencies of each character in the multi-set of all the edges' labels. By using an encoding such as Huffman's, the zero-order empirical entropy of the labels can be approached. One can take a step further and compress the labels to their high-order entropy. As noted by Ferragina et al. [34], if the topology is a tree then we can use a different zero-order model for each context of length k (that is, for each distinct labeled path of length k) preceding a given edge. This is the same observation used to compress texts: it is much easier to predict a particular character if we know the k letters (or the path of length k) that precede it. The entropy model is already very effective in compressing the labels component. The most prominent example of entropy-compressed tree index is the eXtended Burrows Wheeler transform [34], covered more in detail in section 3.5. As far as the topology is concerned, things get more complicated. The topology of a tree with n nodes can be represented in 2n bits via its balanced-parentheses representation. Ferres et al. [35] used this representation and the idea that planar graphs are fully specified by a spanning tree of the graph and one of its dual to represent planar graphs in just 4n bits per node. Recently, Chakraborty et al. proposed succinct representations for finite automata [36]. General graph topologies are compressed well, but without worst-case guarantees, by techniques such as K 2 trees [37]. Further compression can be achieved by extending the notion of entropy to the graph's topology, as shown by Jansson et al. [38], Hucke et al. [39], and Ganczorz [40] for the particular case of trees. As it happens with text entropy, their measures work well under the assumption that the tree topology is not extremely repetitive. See Hucke et al [41] for a systematic comparison of tree entropy measures. In the repetitive case, more advanced notions of compression need to be considered. Straight-line grammars [42,43], that is, context-free grammars generating a single graph, are the gold-standard for capturing such repetitions. These representations can be augmented to support constant-time traversal of the compressed graph [44], but do not support indexing queries on topologies being more complex than paths. Other techniques (defined only for trees) include the Lempel-Ziv factorization of trees [45] and top trees [46]. Despite being well-suited for compressing repetitive tree topologies, also these representations are not (yet) able to support efficient indexing queries on topologies more complex than paths so we do not discuss them further. The recent tunneling [47] and run-length XBW Transform [48] compression schemes are the first compressed representations for repetitive topologies supporting count queries (the latter technique supports also locate queries, but it requires additional linear space). These techniques can be applied to Wheeler graphs [49] as well, and are covered more in detail in Sections 3.5 and 3.7.

Conditional Lower Bounds
Before diving into compressed indexes for labeled graphs, we spend a paragraph on the complexity of matching strings on labeled graphs. The main results in this direction are due to Backurs and Indyk [50] and to Equi et al. [51][52][53], and considered the on-line version of the problem: both pre-processing and query time are counted towards the total running time. The former work [50] proved that, unless the Strong Exponential Time Hypothesis (SETH) [54] is false, in the worst case no algorithm can match a regular expression of size e against a string of length m in time O((m · e) 1−δ ), for any constant δ > 0. Since regular expressions can be converted into NFAs of the same asymptotic size, this result implies a quadratic conditional lower-bound to the graph pattern matching problem. Equi et al. [53] improved this result to include graphs of maximum node degree equal to two and deterministic directed acyclic graphs. Another recent paper from Potechin and Shallit [55] established a similar hardness result for the problem of determining whether a NFA accepts a given word: they showed that, provided that the NFA is sparse, a sub-quadratic algorithm for the problem would falsify SETH. All these results are based on reductions from the orthogonal vectors problem: find two orthogonal vectors in two given sets of binary d-dimensional vectors (one can easily see the similarity between this problem and string matching). The orthogonal vectors hypothesis (OV) [56] states that the orthogonal vectors problem cannot be solved in strongly subquadratic time. A further improvement has recently been made by Gibney [57], who proved that even shaving logarithmic factors from the running time O(m · e) would yield surprising new results in complexity theory, as well as falsifying an hypothesis recently made by Abboud and Bringmann in [58] on the fine-grained complexity of SAT solvers. As noted above, in the context of our survey these lower bounds imply that the sum between the construction and query times for a graph index cannot be sub-quadratic unless important conjectures in complexity theory fail. These lower bounds, however, do not rule out the possibility that a graph index could support efficient (say, subquadratic) queries at the cost of an expensive (quadratic or more) index construction algorithm. The more recent work of Equi et al. [51,52] addressed precisely this issue: they proved that no index that can be built in polynomial time (O(e α ) for any constant α ≥ 1) can guarantee strongly sub-quadratic query times, that is, O(e δ m β ) query time for any constants δ < 1 or β < 1. This essentially settles the complexity of the problem: assuming a reasonably fast (polynomial time) index construction algorithm, subpath-query times need to be at least quadratic (O(m · e)) in the worst case. Since these bounds are matched by existing on-line algorithms [59], one may wonder what is the point of studying graph indexes at all. As we will see, the answer lies in parameterized complexity: even though pattern matching on graphs/regular expressions is hard in the worst case, it is indeed possible to solve the problem efficiently in particular cases (for example, trees [34] and particular regular expressions [50]). Ultimately, in Section 3.8 we will introduce a complete hierarchy of labeled graphs capturing the hardness of the problem.

Hypertext Indexing
Prior to the predominant prefix-sorting approach that we are going to discuss in detail in the next subsections, the problem of solving indexed path queries on labeled graphs has been tackled in the literature by resorting to geometric data structures [60,61]. These solutions work in the hypertext model, where the graph's nodes are long strings and directed edges between those strings indicate how they are connected according to an arbitrarily complicated topology. In these solutions, a geometric structure is used to "glue" nodes connected by an edge. Pattern occurrences spanning one edge are found by issuing a four-sided geometric query for each possible pattern split as seen in Section 2.2. Pattern occurrences entirely contained in a single node are instead matched using a standard compressed index like the ones discussed in Section 2.1. The main issue with these solutions is that they cannot efficiently locate pattern occurrences spanning two or more edges; the solutions proposed in [60,61], based on seed-and-extend, require to visit the whole graph in the worst case (even though on realistic datasets they do work well). In practice, the problem is mitigated by the fact that the strings stored in each node are assumed to be very long. However, this assumption is often not realistic: for instance, in pan-genomics applications a single-point mutation in the genome of an individual (that is, a substitution or a small insertion/deletion) may introduce very short edges deviating from the population's reference genome. Note that here we have only discussed indexed solutions. The on-line case (pattern matching on hypertext without pre-processing) has been thoroughly studied in several other works, see [59,62,63].

Prefix Sorting: Model and Terminology
A very natural approach to solve efficiently the graph indexing problem on arbitrary labeled graphs is to generalize suffix sorting 1 from strings to labeled graphs. In the following, we will make a slight simplification and work with topologies corresponding to the transition function of nondeterministic finite automata (NFAs). In particular, we will assume that: 1. there is only one node, deemed the source state (or start state), without incoming edges, and 2. any state but the source is reachable from the source state.
We will use the terms node and state interchangeably. For reasons that will become clear later, we will moreover assume that the set of characters labeling the incoming edges of any node is a singleton. These assumptions are not too restrictive. It is not hard to see that these NFAs can, in fact, recognize any regular language [64].
We make a little twist and replace the lexicographic order of suffixes with the symmetric co-lexicographic order of prefixes. This turns out to be much more natural when dealing with NFAs, as we will sort states according to the co-lexicographic order of the strings labeling paths connecting the source with each state. As additional benefits of these choices: 1. we will be able to search strings forward (that is, left-to-right). In particular, this will enable testing membership of words in the language recognized by an indexed NFA in a natural on-line (character-by-character) fashion. 2. We will be able to transfer powerful language-theoretic results to the field of compressed indexing.
For example, a co-lexicographic variant of the Myhill-Nerode theorem [65] will allow us to minimize the number of states of indexable DFAs.

Indexing Labeled Trees
Kosaraju [66] has been the first to extend prefix sorting to a very simple class of labeled graphs: labeled trees. To do so, he defined the suffix tree of a (reversed) trie. To understand this idea, we are now going to describe a simplified data structure: the prefix array of a labeled tree.

The Prefix Array of a Labeled Tree
Consider the list containing the tree's nodes sorted by the co-lexicographic order of the paths connecting them to the root. We call this array the prefix array (PA) of the tree. See Figure 7 for an example: node 5 comes before node 8 in the ordering because the path connecting the root to 5, "aac", is co-lexicographically smaller than the path connecting the root to 8, "bc". This node ordering is uniquely defined if the tree is a trie (that is, if each node has at most one outgoing edge per label). Otherwise, the order of nodes reached by the same root-to-node path can be chosen arbitrarily. A nice property of the prefix array of a tree is that, together with the tree itself, it is already an index; in fact, it is a straightforward generalization of the suffix array SA introduced in Section 2 (in this case we call it "prefix array" since we are sorting prefixes of root-to-leaves paths). Since nodes are sorted co-lexicographically, the list of nodes reached by a path labeled with a given pattern Π can be found by binary search. At each search step, we jump on the corresponding node on the tree and compare Π with the labels extracted on the path connecting the node with the root. This procedure allows counting all nodes reached by Π -as well as subsequently reporting them in optimal constant time each -in O(|Π| log n) time, n being the number of nodes.

The XBW Transform
One disadvantage of the prefix array is that, like the SA seen in Section 2.1, it does not achieve compression. In addition to the n log n bits for the prefix array itself, the above binary-search strategy requires also to navigate the tree (whose topology and labels must therefore be kept in memory). This is much more space than the labels, which require n log σ bits when stored in plain format, and the tree topology, which can be stored in just 2n bits (for example, in balanced-parentheses sequence representation). In 2005, Ferragina et al. [34,67] observed that this space overhead is not necessary: an efficient search machinery can be fit into a space proportional to the entropy-compressed edge labels, plus the succinct (2n + o(n) bits) tree's topology. Their structure takes the name XBW tree transform (XBWT in the following), and is a compressed tree representation natively supporting subpath search queries. Consider the co-lexicographic node ordering of Figure 7, denote with child(u) the multi-set of outgoing labels of node u (for the leaves, child(u) = ∅), and with λ(u) the incoming label of node u (for the root, λ(u) = $, where as usual $ is lexicographically smaller than all other characters). Let u 1 , . . . , u n be the prefix array of the tree. The XBWT is the pair of sequences I N = λ(u 1 ) . . . λ(u n ) and OUT = child(u 1 ), . . . , child(u n ). See Figure 8 for an example based on the tree of Figure 7. For simplicity, the tree of this example is a trie (that is, a deterministic tree). All the following reasoning however applies immediately to arbitrary labeled trees.

Inverting the XBWT
As it turns out, the tree can be reconstructed from just OUT. To prove this, we show that the XBWT representation can be used to perform a tree visit. First, we introduce the key property at the core of the XBWT: edges' labels appear in the same order in I N and OUT, that is, the i-th occurrence (counting from left to right) of character c ∈ Σ in I N corresponds to the same edge of the i-th occurrence c in OUT (the order of characters inside each multi-set of OUT is not relevant for the following reasoning to work). For example, consider the fourth 'a' in I N, appearing at I N [5]. This label corresponds to the incoming edge of node 7, that is, to edge (6, 7) The fourth occurrence of 'a' in OUT appears in OUT [6], corresponding to the outgoing edges of node 6. By following the edge labeled 'a' from node 6, we reach exactly node 7, that is, this occurrence of 'a' labels edge (6,7). Why does this property hold? precisely because we are sorting nodes co-lexicographically. Take two nodes u < v such that λ(u) = λ(v) = a, for example, u = 2 and v = 7 (note that < indicates the co-lexicographic order, not pre-order). Since u < v, the a labeling edge (π(u), u) precedes the a labeling edge (π(v), v) in sequence I N, where π(u) indicates the parent of u in the tree. In the example, these are the two 'a's appearing at I N [2] and I N [5]. Let α u and α v be the two strings labeling the two paths from the root to u and v, respectively. In our example, α 2 = $a and α 7 = $ba (note that we prepend the artificial incoming label of the root). By the very definition of our co-lexicographic order, u < v if and only if α u < α v . Note that we can write α u = α π(u) · a and α v = α π(v) · a. Then, α u < α v holds if and only if α π(u) < α π(v) , i.e. if and only if π(u) < π(v). In our example, α π(2) = α 1 = $ < $b = α 6 = α π (7) , thus it must hold 1 = π(2) < π(7) = 6 (which, in fact, holds in the example). This means that the a labeling edge (π(u), u) comes before the a labeling edge (π(v), v) also in sequence OUT. In our example, those are the two 'a' contained in OUT [1] and OUT [6]. We finally obtain our claim: equally-labeled edges appear in the same relative order in I N and OUT.
The XBWT is a generalization to labeled trees of a well-known string transform -the Burrows-Wheeler transform (BWT) [19] -described for the first time in 1994 (the BWT is precisely sequence OUT of a path tree 2 ). Its corresponding index, the FM-index [15] was first described by Ferragina and Manzini in 2000. The property we just described -allowing the mapping of characters from I N to OUT -is the building block of the FM-index and takes the name LF mapping. The LF mapping can be used to perform a visit of the tree using just OUT. First, note that OUT fully specifies I N: the latter is a sorted list of all characters appearing in OUT, plus character $. Start from the virtual incoming edge of the root, which appears always in I N [1]. The outgoing labels of the roots appear in OUT [1]. Using the LF property, we map the outgoing labels of the root to sequence I N, obtaining the locations in PA of the successors of the root. The reasoning can be iterated, ultimately leading to a complete visit of the tree.

Subpath Queries
The LF mapping can also be used to answer counting queries. Suppose we wish to count the number of nodes reached by a path labeled with string Π [1, m]. For example, consider the tree of Figure 7 and let Π = aa. First, find the range PA[ , r] of nodes on PA reached by Π [1]. This is easy, since characters in I N are sorted: [ , r] is the maximal range such that I N[i] = Π [1] for all ≤ i ≤ r. In our example, = 2 and r = 5. Now, note that OUT[ , r] contains the outgoing labels of all nodes reached by Π [1]. Then, we can extend our search by one character by following all edges in OUT[ , r] labeled with Π [2]. In our example, Π[2] = a and there are 2 edges labeled with 'a' to follow: those at positions OUT [2] and OUT [3]. This requires applying the LF mapping to all those edges. Crucially, note that the LF mapping property also guarantees that the nodes we reach by following those edges form a contiguous co-lexicographic range PA[ , r ]. In our example, we obtain the range PA [3,4], containing pre-order nodes 3 and 4. These are precisely the occ = − r + 1 nodes reached by Π[1, 2] (and occ is the answer to the count query for Π [1,2]). It is clear that the reasoning can be iterated until finding the range of all nodes reached by a pattern Π of any length.
For clarity, in our discussion above we have ignored efficiency details. If implemented as described, each extension step would require O(n) time (a linear scan of I N and OUT). It turns out that, using up-to-date data structures [68], a single character-extension step can be implemented in just O log log σ log n time! The idea is, given the range PA[ , r] of Π [1], to locate just the first and last occurrence of Π [2] in OUT[ , r]. This can be implemented with a so-called rank query (that is, the number of characters equal to Π[2] before a given position OUT[i]). We redirect the curious reader to the original articles by Ferragina et al. [34,67] (XBWT) and Belazzougui and Navarro [68] (up-to-date rank data structures) for the exquisite data structure details.
As far as locate queries are concerned, as previously mentioned a simple strategy could be to explicitly store PA. In the above example, the result of locate for pattern Π = aa would be the pre-order nodes PA [3,4] [48] describe a sampling strategy that solves this problem when the nodes' identifiers are DFS numbers: fix a parameter t ≤ n. The idea is to decompose the tree in Θ(n/t) subtrees of size O(t), and explicitly store in O((n/t) log n) bits the pre-order identifiers of the subtrees' roots. Then, the pre-order of 2 To be precise, the original BWT used the symmetric lexicographic order of the string's suffixes. any non-sampled node can be retrieved by performing a visit of a subtree using the XBWT navigation primitives. By fixing t = log 1+ n for any constant > 0, this strategy can compute any PA[i] (i.e. locate any pattern occurrence) in polylogarithmic time while using just o(n) bits of additional space. A more advanced mechanism [48] allows locating the DFS number of each pattern occurrence in optimal O(1) time within compressed space.

Compression
To conclude, we show how the XBWT (and, similarly, the BWT of a string) can be compressed. In order to efficiently support the LF mapping, sequence I N has to be explicitly stored. However, note that this sequence is strictly increasing. If the alphabet is effective and of the form Σ = [1, σ], then we can represent I N with a simple bitvector of n bits marking with a bit '1' the first occurrence of each new character. If σ is small (for example, polylog(n)), then this bitvector can be compressed down to o(n) bits while supporting efficient queries [70]. Finally, we concatenate all characters of OUT in a single string and use another bitvector of 2n bits to mark the borders between each OUT[i] in the sequence. Still assuming a small alphabet and using up-to-date data structures [68], this representation takes nH 0 + 2n + o(n) bits of space and supports optimal-time count queries. In their original article, Ferragina et al. [34] realized that this space could be further improved: since nodes are sorted in co-lexicographic order, they are also clustered by the paths connecting them to the root. Then, for a sufficiently small k ∈ O(log σ n), we can partition OUT by all distinct paths of length k that reach the nodes, and use a different zero-order compressor for each class of the partition. This solution achieves high-order compressed space nH k + 2n + o(n).
Another method to compress the XBWT is to exploit repetitions of isomorphic subtrees. Alanko et al. [47] show how this can be achieved by a technique they call tunneling and that consists in collapsing isomorphic subtrees that are adjacent in co-lexicographic order. Tunneling works for a more general class of graphs (Wheeler graphs), so we discuss it more in detail in Section 3.7. Similarly, one can observe that a repeated topology will generate long runs of equal sets in OUT [48]. Let r be the number of such runs. Repetitive trees (including repetitive texts) satisfy r n. In the tree in Figure 7, we have n = 9 and r = 7. For an example of a more repetitive tree, see [48].

Further Generalizations
As we have understood, prefix sorting generalizes quite naturally to labeled trees. There is no reason to stop here: trees are just a particular graph topology. A few years after the successful XBWT was introduced, Mantaci et al. [71][72][73] showed that finite sets of circular strings -that is, finite collections of disjoint labeled cycles -do enjoy the same prefix-sortability property: the nodes of these particular labeled graphs can be arranged by the strings labeling their incoming paths, thus speeding up subsequent substring searches on the graph. Seven years later, Bowe et al. [74] added one more topology to the family: de Bruijn graphs. A de Bruijn graph of order k for a string S (or for a collection of strings; the generalization is straightforward) has one node for each distinct substring of length k (a k-mer) appearing in S. Two nodes, representing k-mers s 1 and s 2 , are connected if and only if they overlap by k − 1 characters (that is, the suffix of length k − 1 of s 1 equals the prefix of the same length of s 2 ) and their concatenation of length k + 1 is a substring of S. The topology of a de Bruijn graph could be quite complex; in fact, one could define such a graph starting from the k-mers read on the paths of an arbitrary labeled graph. For large enough k, the set of strings read on the paths of the original graph and the derived de Bruijn graph coincide. Sirén et al [75] exploited this observation to design a tool to index pan-genomes (that is, genome collections represented as labeled graphs). Given an arbitrary input labeled graph, their GCSA (Generalized Compressed Suffix Array) builds a de Bruijn graph that is equivalent to (i.e. whose paths spell the same strings of) the input graph. While this idea allows to index arbitrary labeled graphs, in the worst case this conversion could generate an exponentially-larger de Bruijn graph (even though they observe that in practice, due to the particular distribution of DNA mutations, this exponential explosion does not occur too frequently in bioinformatics applications). A second, more space-efficient version of their tool [76] fixes the maximum order k of the target de Bruijn graph, thus indexing only paths of length at most k of the input graph.
In this survey we do not enter in the details of the above generalizations since, as we will see in the next subsection, they are all particular cases of a larger class of labeled graphs: Wheeler graphs. As a result, the search mechanism for Wheeler graphs automatically applies to those particular graph topologies. We redirect the curious reader to the seminal work of Gagie et al. [49] for more details of how these graphs, as well as several other combinatorial objects (FM index of an alignment [77,78], Positional BWT [79], wavelet matrices [80], and wavelet trees [81]) can be elegantly described in the Wheeler graphs framework. In turn, in Section 3.8 we will see that Wheeler graphs are just the "base case" of a larger family of prefix-sortable graphs, ultimately encompassing all labeled graphs: p-sortable graphs.

Wheeler Graphs
In 2017 Gagie et al. [49] generalized the principle underlying the approaches described in the previous sections to all labeled graphs whose nodes can be sorted co-lexicographically in a total order. They called such objects Wheeler graphs in honor of David J. Wheeler, one of the two inventors of the ubiquitous Burrows-Wheeler transform [19] that today stands at the heart of the most successful compressors and compressed text indexes.
Recall that, for convenience, we treat the special (not too restrictive) case of labeled graphs corresponding to the state transition of finite automata. Indexing finite automata allows us to generalize the ideas behind suffix arrays to (possibly infinite) sets of strings: all the strings read from the source to any of the automaton's final states. Said otherwise, an index for a finite automaton is capable of recognizing the substring closure of the underlying regular language. Consider the following regular language: L = ( |aa)b(ab|b) * and consider the automaton recognizing L depicted in Figure 9. Let u be a state, and denote with I u the (possibly infinite) set of all strings read from the source to u. Following the example reported in Figure 9, we have I q 1 = {a}, I q 2 = {b, bb, bab, babb, aab, . . . }, and I q 3 = {aa, aaba, aabba, . . . }. Intuitively, if the automaton is a DFA then we are going to sort its states in such a way that two states are placed in the order u < v if and only if all the strings in I u are co-lexicographically smaller than all the strings in I v (for NFAs the condition is slightly more involved as those sets can have a nonempty intersection). Note that, at this point of the discussion, it is not yet clear that such an ordering always exists (in fact, we will see that it does not); however, from the previous subsections we know that particular graph topologies (in particular, paths, cycles, trees, and de Bruijn graphs) do admit a solution to this sorting problem. The legitimate question, tackled for the first time in Gagie et al.'s work, is: what is the largest graph family admitting such an ordering?
Let a, a be two characters labeling edges (u, u ) and (v, v ), respectively. We require the following three Wheeler properties [49] for our ordering ≤: (i) all states with in-degree zero come first in the ordering, (ii) if a < a , then u < v , and (iii) if a = a and u < v, then u ≤ v .
It is not hard to see that (i-iii) generalize the familiar co-lexicographic order among the prefixes of a string to labeled graphs. In this generalized context, we deal with prefixes of the recognized language, i.e., with the strings labeling paths connecting the source state with any other state. Note also that rule (ii) implicitly requires that all labels entering a state must be equal. This is not a severe restriction: at the cost of increasing the number of states by a factor σ, any automaton can be transformed into an equivalent one with this property [64]. Figure 10 depicts the automaton of Figure 9 after having ordered (left-to-right) the states according to the above three Wheeler properties. Figure 10. The automaton of Figure 9, sorted according to the three Wheeler properties.
An ordering ≤ satisfying Wheeler properties (i)-(iii) is called a Wheeler order. Note that a Wheeler order, when it exists, is always total (this will become important in Section 3.8). An automaton (or a labeled graph) is said to be Wheeler if its states admit at least one Wheeler order.

Subpath Queries
When a Wheeler order exists, all the power of suffix arrays can be lifted to labeled graphs: (1) the nodes reached by a path labeled with a given pattern Π form a consecutive range in Wheeler order, and (2) such a range can be found in linear time as a function of the pattern's length. In fact, the search algorithm generalizes those devised for the particular graphs discussed in the previous sections. Figure 11 depicts the process of searching all nodes reached by a path labeled with pattern Π ="aba". The algorithm starts with Π = (empty string, all nodes) and right-extends it by one letter step by step, following the edges labeled with the corresponding character of Π from the current set of nodes. Note the following crucial property: at each step, the nodes reached by the current pattern form a consecutive range. This makes it possible to represent the range in constant space (by specifying just the first and last node in the range). Similarly to what we briefly discussed in Section 3.5 (Subpath Queries on the XBWT), by using appropriate compressed data structures supporting rank and select on strings, each character extension can be implemented efficiently (i.e. in log-logarithmic time). The resulting data structure can be stored in entropy-compressed space [49].
The algorithm we just described allows us to find the number of nodes reached by a given pattern, i.e. to solve count queries on the graph. As far as locate queries are concerned, Sirén et al. in [75] proposed a sampling scheme that returns the labels of all nodes reached by the query pattern, provided that the Figure 11. Searching nodes reached by a path labeled "aba" in a Wheeler graph. Top left: we begin with the nodes reached by the empty string (full range). Top right: range obtained from the previous one following edges labeled 'a'. Bottom left: range obtained from the previous one following edges labeled 'b'. Bottom right: range obtained from the previous one following edges labeled 'a'. This last range contains all nodes reached by a path labeled "aba" labeling scheme satisfies a particular monotonicity property (that is, the labels of a node are larger than those of its predecessors). In practical situations such as genome-graph indexing, such labels can be equal to an absolute position on a reference genome representing all the paths of the graph. In the worst case however, this scheme requires to store Θ(n) labels of O(log n) bits each. As another option, the sampling strategy described by Arroyuelo et al. [69,Sec. 5.1] and Prezza [48] (see Section 3.5.4 for more details) can be directly applied to the DFS spanning forest of any Wheeler graph, thus yielding locate queries in polylogarithmic time and o(n) + O(c log n) bits of additional redundancy, c being the number of connected components of the graph.

Compression
The nodes of a Wheeler graph are sorted (clustered) with respect to their incoming paths, so the outgoing labels of adjacent nodes are likely to be the similar. This allows applying high-order compression to the labels of a Wheeler graph [49]. It is even possible to compress the graph's topology (together with the labels), if this is highly repetitive (i.e. the graph has large repeated isomorphic subgraphs). By generalizing the tunneling technique originally devised by Baier [82] for the Burrows-Wheeler transform, Alanko et al. [47] showed that isomorphic subtrees adjacent in co-lexicographic order can be collapsed while maintaining the Wheeler properties. In the same paper, they showed that a tunneled Wheeler graph can even by indexed to support existence path queries (a relaxation of counting queries where we can only discover whether or not a query pattern labels some path in the graph). Similarly, one can observe that a repetitive graph topology will generate long runs of equal sets of outgoing labels in Wheeler order. This allows applying run-length compression to the Wheeler graph [48]. Run-length compression of a Wheeler graph is equivalent to collapsing isomorphic subgraphs that are adjacent in Wheeler order, and thus shares many similarities with the tunneling technique. A weaker form of run-length compression of a Wheeler graph has been considered by Bentley et al. in [83]. In this work, they turn a Wheeler graph into a string by concatenating the outgoing labels of the sorted nodes, permuted so as to minimize the number of runs of the resulting string. They show that the (decision version of the) problem of finding the ordering of the Wheeler graph's sources that minimizes the number of runs is NP-complete. In the same work, they show that also the problem of finding the alphabet ordering minimizing the number of BWT runs is NP-complete.

Sorting and Recognizing Wheeler Graphs
Perhaps not surprisingly, not all labeled graphs admit a Wheeler order of their nodes. Intuitively, this happens because of conflicting predecessors: if u and v have a pair of predecessors ordered as u < v and another pair ordered as v < u , then a Wheeler order cannot exist. As an example, consider any automaton recognizing the language L = (ax * b)|(cx * d) (original example by Gagie et al. [49]). Since any string beginning with letter 'a' must necessarily end with letter 'b' and, similarly, any string beginning with letter 'c' must necessarily end with letter 'd', the paths spelling ax k b and ax k d must be disjoint for all k, k ≥ 0. Said otherwise, the nodes reached by label 'x' and leading to 'b' must be disjoint from those reached by label 'x' and leading to 'd'. Denote with u α a state reached by string α (read from the source). Then, it must be u ax < u bx : from the above observation, those two states must be distinct and, from the Wheeler properties, they must be in this precise order. For the same reason, it must be the case that u bx < u axx < u bxx < u axxx < . . . . This is an infinite sequence of distinct states: no finite Wheeler automaton can recognize L .
This motivates the following natural questions: given an automaton A, is it Wheeler? if yes, can we efficiently find a corresponding Wheeler order? It turns out that these are hard problems. Gibney and Thankachan showed in [84] that deciding whether an arbitrary automaton admits a Wheeler order is NP-complete. This holds even when each state is allowed to have at most five outgoing edges labeled with the same character (which somehow bounds the amount of nondeterminism). On the positive side, Alanko et al. [64] showed that both problems (recognition and sorting) can be solved in quadratic time when each state is allowed to have at most two outgoing edges labeled with the same character. This includes DFAs, for which however a more efficient linear-time solution exists [64]. The complexity of the cases in between -that is, at most three/four equally-labeled outgoing edges -is still open [85].

Wheeler Languages
Since we are talking about finite automata, one question should arise naturally: what languages are recognized by Wheeler NFAs? Let us call Wheeler languages this class. First of all, Wheeler languages are clearly regular since, by definition, they are accepted by finite state automata. Moreover, all finite languages are Wheeler because they can be recognized by a tree-shaped automaton, which (as seen in Section 3.5) is always prefix-sortable. Additionally, as observed by Gagie et al. [49] not all regular languages are Wheeler: (ax * b)|(cx * d) is an example. Interestingly, Wheeler languages can be defined both in terms of DFAs and NFAs: Alanko et al. proved in [64] that Wheeler NFAs and Wheeler DFAs recognize the same class of languages. Another powerful language-theoretic result that can be transferred from regular to Wheeler languages is their neat algebraic characterization based on Myhill-Nerode equivalence classes [65]. We recall that two strings α and β are Myhill-Nerode equivalent with respect to a given regular language L if and only if, for any string γ, we have that αγ ∈ L ⇔ βγ ∈ L. The Myhill-Nerode theorem states that L is regular if and only if the Myhill-Nerode equivalence relation has finite index (i.e. it has a finite number of equivalence classes). In the Wheeler case, the Myhill-Nerode equivalence relation is slightly modified by requiring that equivalence classes of prefixes of the language are also intervals in co-lexicographic order and contain words ending with the same letter. After this modification, the Myhill-Nerode theorem can be transferred to Wheeler languages: L is Wheeler if and only if the modified Myhill-Nerode equivalence relation has finite index [64,86]. See Alanko et al. [86] for a comprehensive study of the elegant properties of Wheeler languages (including closure properties and complexity of the recognition problem). From the algorithmic point of view (which is the most relevant to this survey), these results permit to define and build efficiently the minimum Wheeler DFA (that is, with the smallest number of states) recognizing the same language of a given input Wheeler DFA, thus optimizing the space of the resulting index [64].
After this introduction to Wheeler languages, let us consider the question: can we index Wheeler languages efficiently? Interestingly, Gibney and Thankachan's NP-completeness proof [84] requires that the automaton's topology is fixed so it does not rule out the possibility that we can index in polynomial time an equivalent automaton. After all, in many situations we are actually interested in indexing a language rather than a fixed graph topology. Surprisingly, the answer to the above question is yes: it appears to be easier to index languages, rather than automata. Since recognizing and sorting Wheeler DFAs is an easy problem, a first idea could be to turn the input NFA into an equivalent DFA. While it is well-known that in the worst case this conversion (via the powerset algorithm) could result in an exponential blow-up of the number of states, Alanko et al. [64] proved that a Wheeler NFA always admits an equivalent Wheeler DFA of linear size (the number of states doubles at most). This has an interesting consequence: if the input NFA is Wheeler, then we can index its language in polynomial time (in the size of the input NFA). We can actually do better: if the input NFA A is Wheeler, then in polynomial time we can build a Wheeler NFA A that (i) is never larger than A, (ii) recognizes the same language as A, and (iii) can be sorted in polynomial time [86]. We remark that there is a subtle reason why the above two procedures for indexing Wheeler NFAs do not break the NP-completeness of the problem of recognizing this class of graphs: it is possible that they generate a Wheeler NFA even if the input is not a Wheeler NFA (thus they cannot be used to solve the recognition problem). To conclude, these strategies can be used to index Wheeler NFAs, but do not tell us anything about indexing Wheeler languages represented as a general (possibly non-Wheeler) NFA. An interesting case is represented by finite languages (always Wheeler) represented as acyclic NFAs. The lower bounds discussed in Section 3.2 tell us that a quadratic running time is unavoidable for the graph indexing problem, even in the acyclic case. This implies, in particular, that the conversion from arbitrary acyclic NFAs to Wheeler NFAs must incur in a quadratic blow-up in the worst case. In practice the situation is much worse: Alanko et al. showed in [64] that the blow-up is exponential in the worst case. In the same paper, they provided a fast algorithm to convert any acyclic DFA into the smallest equivalent Wheeler DFA.

p-sortable Automata
Despite its power, the Wheeler graph framework does not allow to index and compress any labeled graph: it is easy to come up with arbitrarily large Wheeler graphs that lose their Wheeler property after the addition of just one edge. Does this mean that such an augmented graph cannot be indexed efficiently? clearly not, since it would be sufficient to add a small "patch" (keeping track of the extra edge) to the index of the underlying Wheeler subgraph in order to index it. As another "unsatisfying" example, consider the union L = L 1 ∪ L 1 of two Wheeler languages L 1 and L 1 . In general, L is not a Wheeler language [86]. However, L can be easily indexed by just keeping two indexes (of two automata recognizing L 1 and L 1 ), with no asymptotic slowdown in query times! The latter example is on the right path to a solution of the problem. Take two automata A 1 and A 2 recognizing two Wheeler languages L 1 and L 1 , respectively, such that L = L 1 ∪ L 1 is not Wheeler. The union automaton A 1 ∪ A 2 (obtained by simply merging the start states of A 1 and A 2 ) is a nondeterministic automaton recognizing L. Taken individually, the states of A 1 and A 2 can be sorted in two total orders. However, taken as a whole the two sets of states do not admit a total co-lexicographic order (which would imply that L is Wheeler). The solution to this riddle is to abandon total orders in favor of partial orders [87]. A partial order ≤ on a set V (in our case, the set of the automaton's states) is a reflexive, antisymmetric and transitive relation on V. In a partial order, two elements u, v either are comparable, in which case u ≤ v or v ≤ u hold (both hold only if u = v), or are not comparable, in which case neither u ≤ v nor v ≤ u hold. The latter case is indicated as u v. Our co-lexicographic partial order is defined as follows. Let a, a be two characters labeling edges (u, u ) and (v, v ), respectively. We require the following properties: (i) all states with in-degree zero come first in the ordering, (ii) if a < a , then u < v , and (iii) if a = a and u < v , then u ≤ v.
Note that, differently from the definition of Wheeler order (Section 3.7) the implication of (iii) follows the edges backwards. As it turns out, ≤ is indeed a partial order and, as we show in the next subsections, it allows generalizing the useful properties of Wheeler graphs to arbitrary topologies. A convenient representation for any partial order is a Hasse diagram: a directed acyclic graph where we draw the elements of the order from the smallest (bottom) to largest ones (top), and two elements are connected by an edge (u, v) if and only if u ≤ v. See Figure 12 for an example. The lower bounds discussed in Section 3.2 tell us that indexed pattern matching on graphs cannot be solved faster than quadratic time in the worst case. In particular, this means that our generalization of Wheeler graphs cannot yield an index answering subpath queries in linear time. In fact, there is a catch: the extent to which indexing and compression can be performed efficiently is proportional to the similarity of the partial order ≤ to a total one. As it turns out, the correct measure of similarity to consider is the order's width: the minimum number p of totally-ordered chains (subsets of states) into which the set of states can be partitioned. Figure 12 makes it clear that the states of the automaton can be divided into p = 2 totally-ordered subsets. This choice is not unique (look at the Hasse diagram), and a possible partition is s < q 1 < q 3 (in yellow) and q 2 (in red). We call p-sortable the class of automata for which there exists a chain partition of size p. Since the states of any automaton can always be partitioned into n chains, this definition captures all automata. The parameter p seems to capture some deep regularity of finite automata: in addition to determining the compressibility of their topology (read next subsections), it also determines their inherent determinism: a p-sortable NFA with n states always admits an equivalent DFA (which can be obtained via the standard powerset algorithm) with at most 2 p (n − p + 1) − 1 states. This represents some sort of fine-grained analysis refining the folklore (tight) bound of 2 n [88] and has deep implications to several algorithms on automata. For example, it implies that the PSPACE-complete NFA equivalence problem is fixed-parameter tractable with respect to p [87]. To conclude, we mention that finding the smallest p for a given labeled graph is NP complete, though the problem admits a O(e 2 + n 5/2 )-time algorithm for the deterministic case [87].

Subpath Queries
Subpath queries can be answered on p-sortable automata by generalizing the forward search algorithm of Wheeler graphs. In fact, the following property holds: for any pattern Π, the states reached by a path labeled with Π always form one convex set in the partial order. In turn, any convex set can be expressed as p intervals, one contained in each class of the chain partition [87] (for any chain partition). The generalized forward search algorithm works as follows: start with the interval of on the p chains (the full interval on each chain). For each character a of the pattern, follow the edges labeled with a that depart from the current (at most) p intervals. By the above property, the target nodes will still be contained in (at most) p intervals on the chains. Consider the example of Figure 12. In order to find all nodes reached by pattern 'aa', the generalized forward search algorithm first finds the nodes reached by 'a': those are nodes q 1 (a range on the yellow chain) and q 2 (a range on the red chain). Finally, the algorithm follows all edges labeled 'a' from those ranges, thereby finding all nodes reached by 'aa': nodes (again) q 1 and q 2 which, as seen above, form one range on the yellow chain and one on the red chain. See [87] for a slightly more involved example. Using up-to-date data structures, the search algorithm can be implemented to run in O(|Π| · log(σp) · p 2 ) steps [87]. In the worst case (p = n), this essentially matches the lower bound of Equi et al. [51][52][53] on dense graphs (see Section 3.2). On the other hand, for small values of p this running time can be significantly smaller than the lower bound: in the extreme case, p = 1 and we obtain the familiar Wheeler graphs (supporting optimal-time subpath queries).
As far as locate queries are concerned, the sampling strategy described by Arroyuelo et al. [69,Sec. 5.1] and Prezza [48] (see Section 3.5.4 for more details) can be directly applied to the DFS spanning forest of any p-sortable graph (similarly to Wheeler graphs), thus yielding locate queries in polylogarithmic time and o(n) + O(c log n) bits of additional redundancy, c being the number of connected components of the graph. This is possible since the index for p-sortable graphs described in [87] supports navigation primitives as well and can thus be used to navigate the DFS spanning forest of the graph.

Compression
The Burrows-Wheeler transform [19] (see also Section 3.5) can be generalized to p-sortable automata: the main idea is to (i) partition the states into p totally-sorted chains (for example, in Figure 12 one possible such chain decomposition is {s, q 1 , q 3 }, {q 2 }), (ii) order the states by "gluing" the chains in any order (for example, in Figure 12 one possible such ordering is s, q 1 , q 3 , q 2 ), (iii) build the adjacency matrix of the graph using this state ordering and (iv) for each edge in the matrix, store only its label and its endpoint chains (that is, two numbers between 1 and p indicating which chains the two edges' endpoints belong to). It can be shown that this matrix can be linearized in an invertible representation taking log σ + 2 log p + 2 + o(1) bits per edge. On DFAs, the representation takes less space: log σ + log p + 2 + o(1) bits per edge [87]. This is already a compressed representation of the graph, since "sortable" graphs (i.e. having small p) are compressed to few bits per edge, well below the information-theoretic worst-case lower bound if p n. Furthermore, within each chain the states are sorted by their incoming paths. It follows that, as seen in Sections 3.5 and 3.7, high-order entropy-compression can be applied within each chain.

Conclusions and Future Challenges
In this survey, we tried to convey the core ideas that have been developed to date in the field of compressed graph indexing, in the hope of introducing (and attracting) the non-expert reader to this exciting and active area of research. As it can be understood by reading our survey, the field is in rapid expansion and does not lack of exciting challenges. On the lower-bound side, a possible improvement could be to provide fine-grained lower bounds to the graph indexing problem, e.g. as a function of the parameter p introduced in Section 3.8. As far as graph compression is concerned, it is plausible that new powerful graph compression schemes will emerge from developments in the field: an extension of the run-length Burrows Wheeler transform to labeled graphs, for example, could compete with existing grammar-based graph compressors while also supporting efficient path queries. Efficient index construction is also a considerable challenge. As seen in Section 3.7, sorting Wheeler graphs is efficient only in the deterministic case. Even when the nondeterminism degree is very limited to just two equally-labeled edges leaving a node, the fastest sorting algorithm has quadratic complexity. Even worse, in the general (nondeterministic) case, the problem is NP-complete. The situation does not improve with the generalization considered in Section 3.8: finding the minimum width p for a deterministic graph takes super-quadratic time with current solutions, and it becomes NP-complete for arbitrary graphs. Clearly, practical algorithms for graph indexing will have to somehow sidestep these issues. One possible solution could be approximation: we note that, even if the optimal p cannot be found efficiently (recall that p = 1 for Wheeler graphs), approximating it (for example, up to a poly(n) factor) would still guarantee fast queries. Further challenges include generalizing the labeled graph indexing problem to allow aligning regular expressions against graphs. This could be achieved, for example, by indexing both the graph and an NFA recognizing the query (regular expression) pattern.