Direct Superbubble Detection

Superbubbles are a class of induced subgraphs in digraphs that play an essential role in assembly algorithms for high-throughput sequencing data. They are connected with the remainder of the host digraph by a single entrance and a single exit vertex. Linear-time algorithms for the enumeration superbubbles recently have become available. Current approaches require the decomposition of the input digraph into strongly-connected components, which are then analyzed separately. In principle, a single depth-first search could be used, provided one can guarantee that the root of the depth-first search (DFS)-tree is not itself located in the interior or the exit point of a superbubble. Here, we describe a linear-time algorithm to determine suitable roots for a DFS-forest that is guaranteed to identify the superbubbles in a digraph correctly. In addition to the advantages of a more straightforward implementation, we observe a nearly three-fold gain in performance on real-world datasets. We present a reference implementation of the new algorithm that accepts many commonly-used input formats for digraphs. It is available as open source from github.


Introduction
Bubble structures in a digraph have become the focus of an increasing body of research because of their role in genome assembly and related topics; see, e.g., [1] and the references therein. Onodera et al. [2] proposed superbubbles as an important class of subgraphs in the de Bruijn and overlap digraphs arising in the context of the assembly of high-throughput sequencing data [3,4]. The algorithm identifying all superbubbles in a digraph G with vertex set V and edge set |E| had a running time O(|V|(|V| + |E|)) [2]. An improvement to O(|E| log |E|) was described in [5]. A linear time algorithm for an acyclic subgraph together with the construction of auxiliary digraphs along the lines of [5] provided a solution in O(|E| + |V|), i.e., linear, overall time [6]. An alternative linear-time algorithm [7] achieves a substantial speedup and does not require sophisticated data structures. All these approaches rely on the decomposition ρ(x) < ρ(y) iff x y or y ≺ x π(x) < π(y) iff x y or x ≺ y (1) It follows immediately that preorder and postorder together determine the ancestor and sibling order: x y iff ρ(x) < ρ(y) and π(x) < π(y) x ≺ y iff ρ(x) > ρ(y) and π(x) < π(y) Let G be a digraph. For every vertex r ∈ V(G), denote by V[r] ⊆ V(G) the subset of vertices that are reachable from r, i.e., for which there is a directed path from r to x ∈ V[r]. These paths can be chosen such that every x ∈ V[r] is reachable from r along a unique path, and hence, there is an oriented tree T with V(T) = V[r] that is a subgraph of G. An oriented tree T with root r is a search tree on G if there is no directed edge (x, y) ∈ E(G) with x ∈ V(T) and y / ∈ V(T). An ordered tree T is a search tree if and only V(T) = V[r] because a vertex y ∈ V(G) \ V(T) by definition cannot be reached from anywhere in V(T), and thus also not from the root, while every y ∈ V(T) is by definition reachable from the root r.
Depth-first search (DFS) traverses a digraph G in the following manner: (i) pick a root r ∈ V(G); (ii) recursively, at v ∈ V(G), proceed to the smallest, previously-unvisited out-neighbor of v; (iii) if v has no more unvisited out-neighbors, return to is "parent", i.e., the vertex par(v) from which v was initially reached [8]. Clearly, DFS generates a rooted tree T with directed edges (par(v), v), which are known as the DFS-tree. Lemma 1. Let T be the ordered subtree generated by DFS on a digraph G, and let (u, v) ∈ E(G) with u ∈ V(T). Then, v ∈ V(T) and either v ≺ u (including (u, v) ∈ E(T)), u ≺ v, or v u. In particular, T is a search tree on G.
Proof. Consider a DFS reaching u. The search steps up to par(u) only after exhausting all out-neighbors of u; hence, any edge (u, v) either has been visited before by the DFS process or, otherwise, it is included as an edge as DFS steps down to the subtree of u rooted in v. If v has been accessed before, then v is either an ancestor or descendant of u or u and v are incomparable w.r.t. ≺. In the latter case, there are distinct children x and y of lca(u, v) such that u ∈ T(x) and v ∈ T(y). In a DFS, T(y) is traversed before T(x) if y comes before x in the out-neighbor order of lca (u, v), and thus, v u.
By the construction of DFS, x ∈ V(T) is reachable from the root r along a path in G; hence, V(T) ⊆ V [r]. Suppose there is x ∈ V[r] \ V(T). Along a path p from r to x, let x be the first vertex not reachable from V(T), i.e., there is an edge (u, x ) ∈ E(G) with u ∈ V(T) and x / ∈ V(T), contradicting the first assertion of the lemma.
The DFS process proceeds on V[r] in such a way that the preorder ρ of the DFS-tree T rooted at r records the order in which the vertices are discovered, while the postorder π describes the order in which vertices are completed, i.e., "left", by ascending back to their parent. To see this, denote by ρ and π the order in which vertices are discovered and completed by DFS started at r. By construction, DFS accesses the out-neighbors of v in order of the children of v and completes the traversal of a subtree rooted at a child v of v before proceeding to the subtree of another child. Thus, if u and v are incomparable w.r.t. ≺ in T, then ρ (u) < ρ (v) and π (u) < π (v) if and only if u v in the sibling order. It also follows directly from the definition of DFS that we have ρ (u) < ρ (v) if v ≺ u and π (u) < π (v) if u ≺ v. Hence, ρ and π indeed coincide with the preorder ρ and the postorder π for the traversal of DFS tree T. DFS on a graph G is therefore completely described by the oriented DFS tree T, i.e., the sibling and ancestor order on V[r], and coincides with DFS on T itself.
Hence, the condition that v has been accessed before u can expressed simply as ρ(v) < ρ(u). If u and v are comparable on T, their relative order is determined by Equation (2). We therefore obtain the following simple characterization of DFS-trees:

Corollary 1. A search tree T with postorder ρ on G is a DFS-tree if and only if an edge (u, v) ∈ E(G[V(T)])
is either (i) a tree edge, (ii) an edge connecting two non-adjacent comparable vertices in T, or (iii) ρ(v) < ρ(u) whenever u and v are incomparable w.r.t. ≺ in T.

Weak Superbubbloids
Superbubbles [2] are a complex generalization of "bubbles". Comprising two or more isolated paths connecting a source s to a target t, bubbles are the simplest obstacle in sequence assembly problems [9]. We use here the terminology of [7]: Definition 1. Let G be a digraph, and let (s, t) be an ordered pair of distinct vertices. Denote by U st the set of vertices reachable from s without passing through t, and write U + ts for the set of vertices from which t is reachable without passing through s. Then, the subgraph G[U st ] induced by U st is a superbubbloid in G if the following three conditions are satisfied: (S1) t ∈ U st , i.e., t is reachable from s (reachability condition).
We call s, t, and U st \ {s, t} the entrance, exit, and interior of the superbubbloid. We denote the induced subgraph G[U st ] by s, t if it is a superbubbloid with entrance s and exit t.
The reachability and matching conditions can equivalently be expressed in the following form, which usually is more convenient to use: ). Let G be a digraph, U ⊂ V(G), and s, t ∈ U. Then, U equals the set U st of Definition 1 and satisfies (S1) and (S2) if and only if the following conditions (S.i)-(S.iv) are satisfied. Moreover, U forms a superbubbloid with entrance s and exit t if and only if (S.i)-(S.vi) are satisfied: (S.i) Every u ∈ U is reachable from s. (S.ii) t is reachable from every u ∈ U. (S.iii) If u ∈ U and w / ∈ U, then every w → u path contains s. (S.iv) If u ∈ U and w / ∈ U, then every u → w path contains t. If only (S.i)-(S.v) holds, the s, t is a weak superbubbloid. A (weak) superbubble is a (weak) superbubbloid that is minimal in the following sense: Weak superbubbles differ from superbubbles only by (S.vi), which can be checked in constant time for each candidate (weak) superbubble. The effort to recognize superbubbles and weak superbubbles is therefore essentially the same.
The following observation, which summarizes and slightly generalizes our previous analysis [7], forms the basis of the present contribution. As in previous work on the topic [5][6][7], DFS-trees are a key ingredient. Lemma 3. Let G be a digraph and U st the vertex set of a weak superbubbloid s, t in G, and suppose r is not an interior vertex or the exit of s, t . Then, either Proof. (i) Every digraph can be decomposed into strongly-connected components and acyclic components. If x ∈ V[r], then every vertex reachable from x is also contained in V[r]. Thus, in particular, every strongly-connected component of G is either contained in V[r] or disjoint from V[r]. Sung's theorem ( [5] and [7] (Thm.1)) ensures that every superbubbloid is either contained in a strongly-connected component C or an acyclic component A of G. Now, suppose V[r] ∩ U st = ∅, and let x ∈ U st be the first vertex of the DFS in s, t . By definition (of weak superbubbloids) x = s, since no other vertex in U st is reachable from outside U st , and the DFS assumption does not start at an interior vertex or the exit of s, t . The reachability axiom (S.ii) ensures that every u ∈ U st is reached by the DFS whenever s ∈ V(G), i.e., U st ∈ V[r].
Lemma 3 is a variant of the key theorem of [5].
Corollary 2. Let G = (V, E) be a digraph and U st the vertex set of a weak superbubbloid s, t in G. Let r 1 , r 2 , . . . , r k ∈ V be such that none of the r i are an interior or an exit vertex of s, t . Set Proof. By Lemma 3, U st is either contained in the intersection of two or more reachable sets V[r i ] or is disjoint from it. As an immediate consequence, it is also either contained in the difference of two reachable sets or disjoint from it.

Lemma 4.
Let G be a digraph; let U st be the vertex set of a weak superbubbloid s, t in G; let T be a DFS-tree on G with root r / ∈ U st \ {s}; and let π be the postorder w.r.t. T. Then: Proof. (i) The statement is trivial if s, t is not contained in T. If s, t resides in an acyclic part A of G, there are no back edges because A cannot contain back edges by acyclicity. If s, t is contained in a strongly-connected component C, the proof of Lemma 9 of [7] also implies Assertion (i) because the DFS-tree T, in particular, contains a DFS-tree of C as a subtree and back edges in G can only be located within a strongly-connected component.
(ii) Since the DFS generating T enters s, t through s and leaves it through t, the preorder ρ satisfies ρ(s) < ρ(t). Since t is reachable from every u ∈ U st , we conclude that any DFS reaches t before completing any u ∈ U st ; hence, t precedes any other u ∈ U st in postorder, i.e., π(u) > π(t). Since u ∈ U st is not reachable without passing through s, every other vertex in u ∈ U st precedes s in postorder, i.e., π(s) > π(u). Now, suppose there is some w / ∈ U st with π(s) > π(w) > π(t). Then, w must be reachable from s along a directed path that does not pass through t, a contradiction to the definition of weak superbubbloids. Hence, the vertices of a superbubbloid form an interval in postorder of the DFS-tree T.
Statement (ii) rephrases the key result of [6], although we do not need to assume that G is an acyclic digraph. Conceptually, Lemma 4 suggests that it might not be necessary to first identify the strongly-connected components of G [5] or the construct acyclic auxiliary digraphs [6] in order to find all weak superbubbles. Lemma 2 then ensures that a single DFS-forest is sufficient.

Superbubble Detection
We next show how to retrieve all weak superbubbles of a digraph G that are located within the induced subgraph G[V[r]] of G. To this end, we use a slightly modified version of the algorithm DAGsuperbubble described in [7]. It was originally designed to operate on acyclic auxiliary graphs with a single source. Thus, it could be assumed that a DFS-tree rooted on this source reached all vertices. Here, we intend to apply it to the unmodified input graph, which is neither acyclic, nor guaranteed to have a single source. It, therefore, needs to be modified to deal appropriately with back edges within the DFS-tree and the existence of vertices outside the DFS-tree. To this end, vertices in V[r] that cannot be contained in a superbubble have to be identified. By Lemma 4, there are two possible obstructions for a vertex u: (i) u has an edge that is a back edge in the DFS-tree; (ii) u is incident to an edge (x, u) or (u, x) where x / ∈ V[r]. The basic idea of DAGsuperbubble is to identify minimal intervals in reverse postorder π := |V[r]| − π(v) − 1 of the DFS-tree T that satisfy conditions equivalent to membership in a superbubbloid. These conditions are expressed in terms of a pair of helper functions with the help of reverse postorderπ(v) on T. As in [7], OutParent(v) denotes the first vertex (w.r.t. reverse postorder π) in T from which v can be reached. Similarly, OutChild(v) is the last child vertex reachable from v.
These functions are extended to intervals onπ as follows: In [7], we derived a characterization of weak superbubbloids in terms of OutParent([i, j]) and OutChild([i, j]) for the case of acyclic digraphs. Here, we generalize this condition to general graph using the modified definition of OutParent(v) and OutChild(v). The difference is that the situation that back edges and edges connecting to the outside of the DFS-tree are considered. In either case, the corresponding vertices are marked by −1 or ∞, respectively, to indicate that they cannot be part of superbubbloids. Theorem 1. Let G be a digraph; let T be a DFS-tree on G with a root r that is not an interior vertex or exit of a weak superbubbloid; and denote byπ the reverse postorder on T. Then, s, t is a weak superbubbloid in G whose vertex set U st satisfies U st ∩ V[r] = ∅ if and only if the following conditions are satisfied: Proof. It was shown in [7] (theorem 2) that the statement is true for acyclic digraphs. We first note that by Lemma 3, every weak superbubbloid intersecting V[r] is contained in V[r], i.e., in V(T).
Thus, a weak superbubbloid cannot contain the head or tail of a back edge. Only for Condition (i), we also need to consider the case that Hence, it suffices to rule out false positive weak superbubbloids in G by ensuring that every vertex u that violates one of the three conditions also violates (F1) or (F2). This is achieved by setting (3) implements exactly these conditions. Thus, only weak superbubbloids fulfill (F1) and (F2).
Conversely, it suffices to note that by Lemma 4(ii), every weak superbubbloid forms a contiguous interval w.r.t. the postorder π of T and, thus, also w.r.t. the reverse postorderπ of T.
We denote by Superbubble the algorithm DAGsuperbubble with the modified functions OutParent( . ) and OutChild( . ) as described above. By construction, Superbubble identifies minimal intervals ofπ that satisfy (F1) and (F2); see Figure 1 for an illustration and [7] for full details. Since the modification of OutParent( . ) and OutChild( . ) only amounts to setting additional entries to −1 or ∞, respectively, the performance remains unaffected. According to Theorem 1, the minimal intervals satisfying (F1) and (F2) are exactly the minimal weak superbubbloids and, thus, by definition, the weak superbubbles. Therefore, we have: Corollary 3. Let G be a digraph, and let T be a DFS-tree on G with a root r that is not an interior vertex or exit of a weak superbubble. Then, Superbubble correctly identifies exactly the weak superbubbles s, t in G whose vertex set satisfies U st ∩ V[r] = ∅.
It is straightforward to extend this result to a DFS-forest that covers V(G) entirely: This forest is constructed by first constructing T 1 with root r 1 covering V[r 1 ]. Then, T 2 is constructed from a root r 2 searching on V(G) \ V[r 1 ], and so on; see Lemma 2. This amounts to constructing an auxiliary graph G from G by adding an artificial root r 0 and out-edges (r 0 , r 1 ), (r 0 , r 2 ),. . . , (r 0 , r k ), and defining the sibling order of the roots as r 1 r 2 · · · r k . The DFS-forest F(r 1 , . . . , r k ) with given roots r 1 , r 2 ,. . . , r k on G is then equivalent to the DFS-tree T rooted at r 0 on G if we define the reverse postorder of T such that π(r 0 ) = 0. We note, furthermore, that sibling order, i.e., the order in which the roots are used to seed DFS, is arbitrary.

Corollary 4.
Let G be a digraph, and let F be a DFS-forest on G comprising DFS-trees T i with roots r i , 1 ≤ i ≤ k, none of which is an interior vertex or exit of a weak superbubble. Letπ be the reverse postorder on F, obtained by concatenating the reverse postorders on the constituent DFS-trees. Then, Superbubble correctly identifies exactly the weak superbubbles s, t in G. Furthermore, given the roots r i , Superbubble has a running time of O(|E| + |V|).
Proof. Correctness follows immediately from Corollary 3, the construction of T on the auxiliary digraph G , and Lemma 2. During the DFS, each out-edge is considered exactly once, and each vertex is traversed twice. The number k of required roots is limited by |V|. For each vertex v, checking whether OutParent(v) = −1 or OutChild(v) = ∞ requires checking all neighbors only; hence, the total effort is no more than O(|E| + |V|). The linear time complexity of DAGsuperbubble, finally, is proven in [7]. It is important to note the correctness of Superbubble, Corollary 3, crucially depends on the correct choice of the root r of the DFS-tree. The remaining problem thus is to find a suitable sequence of roots r 1 , r 2 ,. . . , r k .

Definition 3.
A vertex r ∈ V is a legitimate root if for every weak superbubble s, t in G with vertex set U st , we have either U st ⊆ V[r] and t ≺ s (in the ancestor order of a search tree with root r), or U st ∩ V[r] = ∅.
We can summarize the discussion in the following form: The algorithm Superbubble detects all weak superbubbles in G if and only if there is a set {r 1 , r 2 , . . . , r k } of legitimate roots such that the DFS-forest F(r 1 , r 2 , . . . , r k ) covers V(G).

Corollary 6. A vertex r ∈ V is a legitimate root if and only if r is neither an interior nor an exit of a weak superbubble.
Proof. By Corollary 4, a root is legitimate if it is not the exit or an interior vertex of a weak superbubble. Conversely, if r is an interior vertex or the exit of s, t , then a DFS-tree rooted in r reaches the entrance s either not at all or there is not search tree with root r such that t ≺ s, since by definition of a weak superbubble, the exit t is found before s along every path from r to s.

Lemma 5.
Let G be a digraph and v ∈ V(G) a source, i.e., a vertex with in-degree zero. Then, v is a legitimate root.

Proof.
Since v is not reachable from any other vertex, it is only reachable by DFS if the traversal starts in v. By the same argument, v is neither an interior vertex, nor the exit of a weak superbubble and, thus, is a legitimate root.
Unfortunately, there is no guarantee that a digraph G has source vertices, and even if they exist, not every vertex of G is necessarily reachable from them. The task is, therefore, to identify legitimate roots located within strongly-connected components.

Cycles, C -Covers, and C -Cuts
By definition, the vertices c i are pairwise distinct and indexed consecutively along C. Importantly, cycle intervals contain only the interior of the unique path in C connecting the defining endpoints c i and c j . Thus, C(c 1 , The C-distance of two vertices c i and c j along a cycle C is the length of the directed path, i.e., the number of edges, from c i to c j . More explicitly, since the number of inner vertices is one less than the number of edges. In particular, Another useful consequence of the definition of d C is: The following implication will be useful later on: Corollary 7. Let G be a digraph; let C a cycle in G; and let c 1 , c 2 , Proof. If c 1 , c 2 , and c 3 are pairwise distinct, the l.h.s. is true if the path from c 1 to c 2 is a subpath of the path from c 1 to c 3 , i.e., c 1 / ∈ C(c 2 , c 3 ) and, thus, For c 1 = c 2 , both the l.h.s. and the r.h.s. are satisfied only if c 2 = c 3 .
In the following, it will be useful to know whether two vertices on a cycle are also reachable via a directed path that is disjoint from C. We formalize this idea as a binary relation on C.
Definition 5. Let G be a digraph and C a cycle in G and s, t ∈ V. Then, t is C-reachable from s, in symbols We have used the letters s and t here since C-reachability will be used to identify potential candidates for entrance and exit of superbubbles. C-reachability is defined not only for vertices in the "reference cycle" C. It satisfies a restricted transitivity property: If v ∈ V(G) \ C, s C v, and v C t, then s C t. Another interesting observation is that s C s implies that there is a directed cycle C such that C ∩ C = {s}. As an immediate consequence of Definition 5, we obtain: Then, c 1 and c 2 are connected by (at least) two edge-disjoint directed paths. In particular, Definition 6. Let C be a cycle in the digraph G and u, v ∈ C. Then, As Consider two C-intervals C(u, v) and C(x, y) on the same cycle C of the digraph G. We say that , since the interval boundaries themselves are not considered part of the C-intervals. For each pair of distinct C-intervals, thus exactly one of the following four statements is true: (a) the C-interval are disjoint; (b) one C-interval is contained in (i.e., a proper subset of) the other one; (c) one C-interval, say C(x, y), extends the other one, but not vice versa, i.e., x ∈ C(u, v) and y / ∈ C(u, v); (d) both C-intervals extend each other, i.e., x, y ∈ C(u, v). Figure 2 illustrates the four cases. Note that in Case (c), the interval boundaries are arranged in the order In the following, we will use the notation: for the set of all C -covered intervals and the set of all C -covered vertices of C, respectively. Note that ∅ ∈ Q(C) since u C v holds for (u, v) ∈ E(C). By the same argument, there is at least one interval C(u, v) ∈ Q(C) for each u ∈ C, albeit some or even all of these may be empty.

Definition 7.
A subset B ⊆ Q(C) is a C -cover of C if B∈B B = Q(C), and B is a total C -cover of C if B∈B B = C. We say that C is totally C -covered if C has a total C -cover.
Note that C is totally C -covered if and only if Q(C) = C.
Obviously, C is either totally C -covered or it has a non-empty set K(C) of C -cut vertices.
In other words, in a clean C -cover, no C -covered interval is contained within another one.

Corollary 8.
Let C be a cycle in the digraph G, and let B be a clean C -cover. Then, either B = {∅} or, for every Since the empty set is a subset of every other set, d C (u, v) > 1 for every Lemma 7. Let C be a cycle in the digraph G. Then, Q(C) contains a clean C -cover B.
Proof. Let B ⊆ Q(C) be a set of C -covered intervals that together C -cover Q(C). Suppose B is not clean. Then, there are two intervals C(p, q) ∈ B and C(u, v) ∈ B such that C(p, q) C(u, v). Then, B = B \ {C(p, q)} still C -cover Q(C). The removal of such redundant intervals can be repeated until no further removable interval can be found. By Definition 9, the remaining C -cover is clean.

Definition 10.
Let C be a cycle in a digraph G. Then: By definition, L(C) consists of all C -covered intervals for which there is no larger C -covered interval with the same starting point. Since every C(p, q) ∈ Q(C) \ L(C) is contained in a interval with the same starting point, L(C) is a C -cover of C. Thus, Lemma 7 implies: Corollary 9. Let C be a cycle in a digraph G. Then, there is clean cover B ⊆ L(C).

Lemma 8.
Let C be a cycle in the digraph G. A clean C -cover B of C is total if and only if B = ∅ and every B ∈ B is extended by at least one B ∈ B.
Proof. If B = ∅, then Q(C) = ∅, and thus, B is not total. In the following, we assume B = ∅ is a clean C -cover. Suppose, for contradiction, that C(u, v) ∈ B is not extended by any B ∈ B. Then, any interval B ∈ B C -covering v would have to contain C(u, v), contradicting the assumption that B is clean. Thus, v is a C -cut vertex, and hence, B is not total. If B, therefore it is non-empty, and every B ∈ B is extended by some B ∈ B.
Conversely, suppose c is a C -cut vertex of C. If C(u, c) ∈ B for some u, then the first part of the proof implies that C(u, c) is not extended by any B ∈ B. If B contains no interval C(u, c), then consider the vertex v such that C(u, v) ∈ B for which d C (v, c) is minimal. Since c is a C -cut vertex, there is no extension of C(u, v), since any such extension B would either contradict the minimality of d C (v, c) or C -cover c, thereby contradicting the assumption that c is a C -cut vertex. Thus, v is again a C -cut vertex. As shown in the first part of the proof, C(u, v); therefore, it is not extended by any B ∈ B. We conclude that unless B is a total C -cover or B = ∅, there is an interval B ∈ B without an extension. Figure 3 shows an example of a cycle with a total C -cover and a cycle with a C -cut, respectively.
Since the largest C -interval in C is C(v, v) = C \ {v} for some v, every total C -cover comprises at least two C -covered intervals.  The green cycle C in the top panel has five C -paths indicated in red. In the middle panel, C is laid out linearly to emphasize the C -covered intervals. Below, the clean C -cover obtained by removing all C -intervals that are contained in longer ones. Note that every C -interval is extended by another one; hence, the C -cover is total. (b) Again, the top panel highlights C in green and the C -paths in red. The linear layout below highlights that Vertex 1 is not C -covered. Thus, it is a C -cut vertex.
The following lemma provides us with a convenient way to obtain a total C -cover.

Lemma 9.
Let C be a cycle in G, v ∈ C, and c 1 , and v C c 4 imply that B := {C(c 1 , c 4 ), C(c 2 , c 3 )} is a total clean C -cover of C.
Proof. By construction, C(c 1 , c 4 ) and C(c 2 , c 3 ) are C -covered intervals. By definition, we have c 2 ∈ C(c 1 , c 4 ), and C(c 2 , c 3 ) extends C(c 1 , c 4 ). Since c 3 ∈ C(c 1 , c 4 ), the two intervals cover all of C. Furthermore, the cover B consists of only two intervals that are not subsets of each other; thus, it is clean.
We will refer to this type of total clean C -cover as a single-vertex cover of C. An example is shown in Figure 4.  Figure 3, the cycle C and the C -paths are highlighted in green and red, respectively. The paths (0, 5, 4) and (3, 5, 1) imply that C(0, 4) and C(3, 1) are C -covered. It is a one-vertex cover conforming to Lemma 9.

Cycles, C -Cover, C -Cuts, and Superbubbles
A key result of [5] states that every superbubble is either contained in or disjoint of any strongly-connected components. The following results on the interaction of cycles and superbubbles are a generalization of this observation. The acyclicity condition (S.v) can be restated in the following way: Lemma 10. Let s, t be a weak superbubbloid in the digraph G and u ∈ s, t . Then, every cycle containing u also contains s and t.
Proof. If u = s, then all in-neighbors of u are contained in s, t . Similarly, if u = t, then all out-neighbors of u are contained in s, t . Since every cycle through u contains both in-and out-neighbors of u, it, in particular, contains an edge e in s, t . (S.v) now implies any cycle through e contains both s and t.
Lemma 11. Let C be a cycle in the digraph G, and let B be a total clean C -cover of C. If C(u, v) ∈ B, then v is neither an interior, nor an exit of a weak superbubble, i.e., v is a legitimate root.
Proof. Assume, for contradiction, that v is an interior or the exit of the superbubble s, t . Since C is totally C -covered by assumption, Corollary 8 implies d C (u, v) > 1. Thus, by Lemma 6, there are (at least) two edge-disjoint paths from u to v. Since neither path can leave s, t before passing through s, neither of them contains the entrance s, and hence, both are contained in the weak superbubble. Thus, s, t contains C(u, v).
Since B is a total clean C -cover of C, there is an interval C(p, q) ∈ B that extends C(u, v), i.e., p ∈ C(u, v), and hence, p is an inner vertex of s, t . Therefore, s, t contains C(p, q), and it again has an extending C -interval. Repeating the argument, we conclude that every vertex of Q(C) is an inner vertex of s, t . Since the cover B is total, Q(C) = C, i.e., the cycle C consists entirely of interior vertices of s, t , i.e., C is a proper subset of s, t . This contradicts the acyclicity condition (S.v).

Corollary 10.
Let C be a cycle in the digraph G. Suppose C is totally C -covered, and let C(u, v) ∈ L(C) such that Then, v is a legitimate root.
Proof. The longest C -interval C(u, v) ∈ L(C) by construction cannot be contained within another C -interval. Therefore, C(u, v) is contained in the clean cover B ⊆ L(C) of Corollary 9. By Lemma 11, its endpoint v is a legitimate root.
Let us now turn to cycles with C -cut vertices: Lemma 12. Let C be a cycle of the digraph G, and let c be a C -cut point of C, i.e., c ∈ K(C). Then, c is not an interior vertex of any weak superbubble.
Proof. Assume, for contradiction, that c is an interior vertex of a weak superbubble s, t . Then, there is a path p from s to t not passing through c. Otherwise, s, c is a superbubbloid, contradicting the assumption that s, t is a weak superbubble; see corollary 5 in [7]. Along p, let u be the last vertex on C before c, and let v be the first vertex on C after c. Thus, u C v. Therefore, c is C -covered in C, a contradiction.
The example in Figure 5 shows that it is possible that every entrance of superbubble is at the same time the exit of another superbubble. Such graphs do not have any legitimate root. Nevertheless, it is easily possible to obtain all the superbubbles. To this end, fix a C -cut vertex c for some cycle C in G, and consider the auxiliary digraph G # obtained from G by splitting c into two vertices c and c so that c retains only the in-edges and c retains only the out-edge.
Lemma 13. Let C be a cycle in the digraph G, c ∈ C a C -cut vertex, and G # the digraph obtained from G by splitting c. If s, t is a weak superbubble in G, then it is also a weak superbubble in G # , where c as an entrance in G corresponds to c in G # and c as an exit in G corresponds to c in G # . Conversely, every weak superbubble s, t with {s, t} = {c , c } in G # is also a weak superbubble in G.

Proof.
For the proof, we construct the auxiliary graphG # by inserting the edge (c , c ) into G # . Then, there is a 1-1 relationship between the set of paths in G and the set of paths that do not start or end with the edge (c , c ) inG # , which is constructed as follows: If p starts at c in G, it starts in c inG # ; if p ends at c in G, it ends at c inG # ; and if p runs through c in G, then it runs through the edge (c , c ) inG # . The 1-1 correspondence of weak superbubbles now follows immediately from the equivalence of the path systems in G andG # since reachability is the same for every pair u, v, with c as the starting point corresponding to c and c as the endpoint corresponding to c . Thus, G andG # have the same superbubbles, except possibly for the ones with {s, t} = {c , c } inG # . Now, consider a DFS-tree on G # rooted in c . The edge (c , c ) is not a tree edge and necessarily appears as a back edge. Since c is a C -cut vertex, c and c are not interior vertices of any weak superbubble inG # . Thus, the edge (c , c ) does not affect any weak superbubble ofG # , and thus, G # andG # have the same weak superbubbles, except possibly the ones with {s, t} = {c , c }. The only potential differences between the weak superbubbles of G and G # is, therefore, the possibility that G # contains c , c or c , c as an additional weak superbubble. Of course, it is easy to detect and remove the additional weak superbubble. Since c is a source in G # , we can apply Superbubble to G # and remove the possible spurious weak superbubble c , c in order to obtain the correct set of weak superbubbles of G. In contrast to the auxiliary digraph constructions suggested in [5], G # contains only a single extra vertex instead of doubling the size. More importantly, however, it not necessary to construct G # explicitly. Instead, on can modify the DFS starting at c in G in the following manner: when c is encountered for the first time as an out-neighbor of a tree vertex u, then c is inserted as with parent u and no further out-neighbors, with only a constant overhead. The algorithm Superbubble applied to G # extracts the minimal intervals satisfying (F1) and (F2) (w.r.t.) the reverse postorderπ of the DFS-tree rooted as c , and thus correctly identifies the weak superbubbles of G # . The modified DFS on G rooted at c by construction yields the same DFS-tree on G # , and thus the same reverse postorder. Together with setting OutChild(c ) = OutChild(c), OutParent(c ) = OutParent(c), OutChild(c ) = ∞, and OutParent(c ) = −1, Superbubble operating on the modified DFS-tree thus correctly identifies the weak superbubbles in G # . We refer to this algorithm, which is equivalent to applying Superbubble to G # , as Superbubble#.
Definition 11. Let G be a digraph. Then, r ∈ V is a quasi-legitimate root if either: (i) r is source in G, (ii) r is the end point of an interval C(u, r) ∈ B of a total clean C -cover of some cycle C in G, or (iii) r is C -cut vertex of some cycle C in G.
Our discussion so far can be summarized as:

Corollary 11. Algorithm Superbubble# correctly identifies the superbubbles in G[V[r]] if and only if r is a quasi-legitimate root.
As an immediate consequence of Lemmas 11 and 12, every cycle contains a quasi-legitimate root. Recalling that every vertex in the digraph G can be reached either from a source vertex or from a cycle, we finally obtain: Theorem 2. Every digraph G contains a set of quasi-legitimate roots {r 1 , r 2 , . . . , r k }. Given these roots, the algorithm Superbubble# correctly identifies all superbubbles of G in linear time.
It remains to show, therefore, that a suitable set of roots can be identified in linear time. Clearly, this is possible for the sources. For superbubbles that cannot be reached from a source vertex, a suitable set of cycles needs to be identified.

Proof.
Let v i be the first root of F that can reach any vertex of C. Then, by definition of a cycle, The same is true for strongly-connected components: Lemma 15. [8] (corollary 11) Let S be a strongly-connected component in G, and let T be a DFS-tree with S ⊆ V(T). Then, there is a vertex v ∈ S such that S ⊆ V(T[v]). We call v the root of the strongly-connected component S in T.
Our aim is now to find a set of "start cycles" such that every cycle C is reachable from at least one of these start cycles.
Lemma 16. Let T be a DFS-tree on the digraph G rooted in v, and let W be the set of ≺-maximal vertices w that have an incoming back edge (u, w). Then, (i) w ∈ W is contained in a cycle, and (ii) every cycle Proof. Property (i) is an immediate consequence of the definition of DFS. Now, suppose u / ∈ V(T[w]) for some w ∈ W. Then, by construction, none of the vertices along the path from the root v to u have an incoming back edge, and thus, neither u, nor one of its ancestors are contained in a cycle. Thus, if x ∈ C for some cycle C ⊆ V(T), then a vertex w ∈ W exists such that x ∈ V(T[w]), and thus, C ⊆ V(T[w]).
does not contain a cycle. Since the vertex set of every cycle in the digraph G is necessarily contained in one of the constituent trees of a DFS-forest, we immediately obtain: Corollary 12. Let F be a DFS-forest on the digraph G, and let W by the set of ≺-maximal vertices w that have an incoming back edge (u, w). Then, (i) w ∈ W is contained in a cycle, and (ii) every cycle C in G is satisfied C ⊆ V(T[w]) for some w ∈ W and some T ∈ F. Lemma 17. A set of cycles {C 1 , C 2 , ..., C n } from which all cycles in G are reachable can be constructed in O(|E| + |V|) time.
Proof. The DFS-forest F on the digraph G is obtained in O(|E| + |V|) time. The set W is easily identified by a preorder traversal of F omitting a subtree as soon as a vertex w has an incoming back edge. The worst-case effort is O(|V|) since we only traverse the forest, not the entire digraph G. Given W and the associated back edges (u k , w k ) identified in the previous step for each w k ∈ W, the cycle C k is explicitly retrieved by following the parent links of F from u k back to w k in O(|V|) time.
Lemma 17 ensures that a sufficient set of cycles can be found in linear time. More precisely, using the sources of G and a quasi-legitimate root r i in each cycle C i as roots, the algorithm Superbubble# correctly identifies all superbubbles in G in linear time. It remains to show that we can identify a quasi-legitimate root in a cycle C i .

Identification of Quasi-Legitimate Roots
The obvious approach to identify quasi-legitimate roots is to construct a clean C -cover. The obvious starting point is L(C) since it requires the construction of no more than the |C| C -path. This can be achieved in polynomial time, e.g., using an independent DFS-tree rooted at c ∈ C that ignores the edges of C. This naive approach, however, exceeds linear time even for a single cycle.
For c ∈ C, we construct a modified DFS-tree T c by excluding all other vertices of C from G. By construction, u ∈ C is C -reachable from c if and only if T c contains an in-neighbor u of u, i.e., there is an edge (u , u) ∈ E(G) with u ∈ V(T c ).
For each v ∈ V(T c ), we are interested in the vertices c min ∈ C and c max ∈ C that are C -reachable from v and minimize and maximize d C (c, c min ) and d C (c, c max ). Repeating this for each c ∈ C, however, will, in general, exceed linear time since the length |C| is not bounded in general.
We can mostly reuse the information stored in T c , however. A crucial observation is the following: Lemma 18. Let C be a cycle of the digraph G; consider two distinct cycle vertices c 1 , c 2 ∈ C; and let v / ∈ C with c 1 C v and c 2  (2), this implication can be used in particular for every c ∈ C for which v C c might hold. Therefore, the same two vertices minimize and maximize d C (c 1 , . ) and d C (c 2 , . ), and thus, we arrive at c min [c 2 c 3 ). Then, c 3 = c 4 (otherwise, the distances would be equal), and Corollary 7 implies c 1 , c 4 ). By Lemma 9, B := {C(c 1 , c 4 ), C(c 2 , c 3 )} is a one-vertex cover of C.
The use of Lemma 18 is that it allows either to use the c min [c 1 , v] and c max [c 1 , v] values also for c 2 , or we obtain a one-vertex C -cover, which immediately provides us with a legitimate root according to Lemma  , v] for all v ∈ V(T c ) correctly. We have already seen above how to handle tree edges. Forward-edges in T c do not effectively contribute, because the same information (minimization or maximization over values of d C (c, . )) is also propagated stepwise along the tree-edges. Cross edges, on the other hand, could add information. Postorder traversal ensures, however, that the pertinent information at their starting points is already computed in time to include them to compute the correct value, i.e., we simply have to include the cross-edges in the minimization/maximization step.
Back edges are problematic when belonging to the same strongly-connected component S as C S. In this case, they can be reached from a cycle vertex c ∈ C and themselves reach a cycle vertex u ∈ C. Such back edge, therefore, influence which cycle vertices are reachable. To handle this information, S is split into parts that are strongly connected components under the use of C -reachability. More precisely, Proof. The only missing information could be a back edge (u, w) with u ∈ V(T[v]) and v ≺ w. Such a back edge cannot exist because v is by assumption the root of a C -SCC, and thus, there is no cycle including u, v, and w ∈ G[V(G) \ C].
After processing all vertices of C, we have either found a one-vertex C -cover of C, or we know, for every c j ∈ C, the largest C -covered interval C(c j , c max (c j )). Thus, we directly conclude: L(C) := {C(c j , c max (c j ))|c j ∈ C and c max (c j )} (9) In particular, we have shown that for each C, L(C) or a one-vertex cover can be constructed in linear time.
To detect a quasi-legitimate root, it is necessary to first decide whether C has a total C -cover or a non-empty set K(C) of C -cut vertices exists. To this end, a clean C -cover B can be used efficiently.
Recall that by Lemma 8, every interval in a clean C -cover is extended by at least one other interval from the C -cover. Since a clean C -cover contains at most |C| intervals, it is easy to check in linear time whether a C -cut vertex exists: starting from an arbitrary C(u, v) ∈ B, we initialize the upper bound of the C -covered part of C that starts at the successor of u by , in which case a total cover is found, and otherwise, we update x with max(x, d C (u,v)). If no total cover is found when the intervals are exhausted, then x is a C -cut vertex (see the proof of Lemma 8). With the C(u, v) stored, e.g., as array a[u], a total cover or the C -cut vertex x is found in O(|C|) operations.
In practice, however, we do not have access to a clean C -cover. However, L(C) can be computed in linear time. By Corollary 9, there is a clean C -cover B ⊂ L(C). We can thus use the same procedure. The redundant intervals in L(C) are, by definition, contained within intervals belonging to B, and thus, they do not change the results provided the initial interval C(u, v) is contained in the clean cover B. By Corollary 10, this is true for the longest interval C(u, v) ∈ L(C). Since L(C) contains at most |C| intervals, the longest interval and a cut point or the validation of a total cover can be computed in O(|C|).
When L(C) it is a total C -cover, the longest interval C(u, v) is contained in a total clean cover, and thus, v ] is considered once. The recursive computation along each T j is also linear. Since the T j are disjoint, the total effort is still linear.
Finally, we note that by construction, no vertex in ]. Hence, when processing the next cycle C , the vertices (and edges) already visited in the context of processing C are irrelevant, and thus, G[V[C]] can be disregarded. In other words, the DFS for the next cycle can be performed in the same digraph G, with all previously processed induced subgraphs marked as finished. This ensures an overall linear running time for the identification of starting points for all cycles C i as in Lemma 17. Algorithm 1 get_root(C, G) computes a C -cover and determines Q(C), as well as a quasi-legitimate root in C. Require: digraph G = (V, E) and cycle C for c ∈ C do create DFS-tree T c with root c by ignoring finished and cycle vertices with preorder ρ.
while v traverses T c in postorder do for c ∈ C in cycle order starting from the successor of u do

Putting It All together
Theorem 3. Algorithm 2 correctly identifies the superbubbles of a digraph G in linear time.
Proof. Theorem 2 ensures that for every digraph G, there is a set R of quasi-legitimate roots such that, given R, the algorithm Superbubble# identifies all superbubbles of G in linear time. Every vertex in V(G) is reachable from a source or a cycle in G. By Lemma 5, all sources are legitimate roots. Lemma 17 shows that a set of cycles can be constructed in linear time from which all vertices of G can be reached by DFS. Algorithm 1 identifies a quasi-legitimate root in a cycle (Lemma 20). As discussed in the text following Lemma 20, the effort for this step is again linear in size of G. Algorithm 2 therefore correctly identifies the superbubbles of a digraph G and does so in O(|E| + |V|) time.

Algorithm 2 Identification of all superbubbles in an arbitrary digraph G.
Require: Digraph G R ← all sources in G generate a random DFS-forestF find set W of ≺-maximal vertices with a back edge inF generate set C of cycles from W withF for all cycles C k ∈ C do run get − root(C k , G) to identify quasi-legitimate root r k add r k to R generate DFS-forest F with root set R run Superbubble# on F

Results
We extended the "Linear Superbubble Detection" (https://github.com/Fabianexe/Superbubble) software LSD [7] with the new algorithm presented in the previous section. LSD is written in Python and uses the NetworkX package [10] to handle graph data structures. Since the same data structures are used, benchmarking the different algorithms provided in LSD allows a fair comparison of running times.
In the implementation, we deviated from the presentation above in two minor details. First, instead of using the reverse postorder of the DFS-tree, we directly used postorder and the corresponding (trivial) redefinitions of the helper functions OutChild() and OutParent(). Second, we did not completely separate the determination of the cycles, the identification of the roots, and the identification of the superbubbles. Instead, we performed cycle search, root detection, and superbubble identification immediately for each DFS-tree. Since cycles and superbubbles are necessarily completely contained within the DFS-trees, this does not affect the correctness of the algorithm. As a by-product, we obtained a speedup by a constant factor because cycles reachable within a given DFS-tree were marked as "already processed" in the superbubble detection step and hence were not (superfluously) considered as candidate additional roots.
In order to benchmark the direct detection algorithm in comparison to other linear-time superbubble detection algorithms, we used the same datasets as in our previous work [7]. In order to guarantee comparability, performance data for all algorithms were computed with the same version of LSD on the same hardware. The results are summarized in Table 1.
For most datasets, we observed an approximately three-fold speedup of Directbubble compared to LSD. The exception is the Slashdot dataset for which no performance gain was observed.
To understand this outlier, it is necessary to understand the source of the speedup in the other test cases. In a typical case, both Directbubble and LSD performed three depth-first searches: in LSD, they are used to determine SCCs, create auxiliary graphs, and detect superbubbles. Directbubble uses them to identify the cycles, quasi-legitimate roots, and finally the superbubbles. Both need to handle exceptional cases. LSD requires the construction of the Sung graph if an SCC coincides with a connected component of the input graph (rather than being just part of it). Since the Sung graph is twice the size of the SCC, this roughly doubles the running time. Directbubble behaves exceptionally for vertices that are reachable from a source. In this case, the detection of cycles and quasi-legitimate roots in cycles was skipped, incurring a substantial speedup. When a graph had neither an SCC that was also a connected component, nor large subgraphs reachable from a source, then LSD and Directbubble essentially performed the computations and thus performed very similarly. The Slashdot dataset is such a case. Typically, however, directed graphs have some sources so that Directbubble outperforms its competitors on most real-life graphs. Table 1. Comparison of running times. The five combinations of algorithms compared here are: Db (Directbubble) refers to the new approach described in this contribution. LSD (using the auxiliary graphsĜ(C) and the stack-based superbubble detector) refers to the algorithm proposed in [7]. S + LSD combines the Sung graphs as auxiliary graphs [5] with LSD stack-based detector plus a post-filter for the false positives. LSD + B uses the LSD graph construction with the range-query-based detector of [6], and S + B uses Sung graphs together with the range-query-based detector, as well as the necessary post-filters; see [7] for full details. All computations were performed on a 2.5-GHz quad-core Intel Core i7 processor (Turbo Boost up to 3.7 GHz) with 6-MB shared L3 cache and 16 GB of 1600-MHz DDR3L onboard memory. Test datasets were taken from [11] and from the Stanford Large Network Dataset Collection [12]. For each test graph, we list the number of vertices N, the numbers of edges M, and the number S of superbubbles.

Conclusions
In this contribution, we extended the body of results describing properties of superbubbles, a particular class of induced subgraphs of a digraph. The analysis presented here was motivated by the observation that in principle, all superbubbles in G can be identified in linear time in a single depth-first search, provided the roots of the individual DFS-trees are known beforehand. Our main result is the observation that a suitable set of starting points, which we call quasi-legitimate roots, (1) always exists in every given digraph and (2) can be identified in linear time, using two additional DFSs. In the first pass, a suitable set of cycles is constructed such that every node in G is reachable from a source vertex of one of these cycles. In the second pass, a peculiar structure of "detours" in a cycle C is used to identify quasi-legitimate roots in a given cycle. To this end, we defined a notion of C -reachability that may also be interesting in its own right to characterize (short) cycles.
A comparison of running times of Directbubble and previous approaches shows that practically useful performance gains are obtained essentially from two sources: (1) we dispense with the construction of auxiliary graphs and (2) we can avoid most of the processing for all vertices reachable from a source in G. In practice, we observed a speedup of about a factor of three on most, but not all, benchmark cases. In all cases, Directbubble performed at least as good as all competing algorithms for superbubble detection.
Author Contributions: F.G. and P.F.S. designed the study, developed the theoretical results, and wrote the manuscript. F.G. implemented the algorithm and evaluated its performance.

Conflicts of Interest:
The authors declare no conflict of interest.