Abstract
Superbubbles are a class of induced subgraphs in digraphs that play an essential role in assembly algorithms for high-throughput sequencing data. They are connected with the remainder of the host digraph by a single entrance and a single exit vertex. Linear-time algorithms for the enumeration superbubbles recently have become available. Current approaches require the decomposition of the input digraph into strongly-connected components, which are then analyzed separately. In principle, a single depth-first search could be used, provided one can guarantee that the root of the depth-first search (DFS)-tree is not itself located in the interior or the exit point of a superbubble. Here, we describe a linear-time algorithm to determine suitable roots for a DFS-forest that is guaranteed to identify the superbubbles in a digraph correctly. In addition to the advantages of a more straightforward implementation, we observe a nearly three-fold gain in performance on real-world datasets. We present a reference implementation of the new algorithm that accepts many commonly-used input formats for digraphs. It is available as open source from github.
1. Introduction
Bubble structures in a digraph have become the focus of an increasing body of research because of their role in genome assembly and related topics; see, e.g., [] and the references therein. Onodera et al. [] proposed superbubbles as an important class of subgraphs in the de Bruijn and overlap digraphs arising in the context of the assembly of high-throughput sequencing data [,]. The algorithm identifying all superbubbles in a digraph G with vertex set V and edge set had a running time []. An improvement to was described in []. A linear time algorithm for an acyclic subgraph together with the construction of auxiliary digraphs along the lines of [] provided a solution in , i.e., linear, overall time []. An alternative linear-time algorithm [] achieves a substantial speedup and does not require sophisticated data structures. All these approaches rely on the decomposition of G into its strongly-connected components and require the construction of intermediate auxiliary digraphs. Here, we show that the subdivision of the problem, as well as the construction of auxiliary digraphs, can be avoided. This additional simplification yields a further performance gain.
2. Theory
2.1. Oriented Trees and DFS-trees
A directed graph (digraph) G consists of a vertex set and a set of directed edges such that implies . H is a sub-digraph of G if and . is the subgraph induced by W if and if and only if and .
An oriented tree T is a connected digraph in which there is a single vertex, called the root with indegree zero, and every other vertex has in-degree one. The vertices with out-degree zero are the leaves. Given an edge , we say u is the parent of v, in symbols , while v is a child of u. By definition, there is a unique directed path from r to . The ancestor partial order ⪯ on is defined by if and only if v is on the path . The least common ancestor (lca) of two vertices is the ≺-minimal vertex in . The subtree rooted in v is the subgraph of T induced by the vertex set , i.e., those that are reachable along the directed path that contain v.
We assume that G is endowed with an arbitrary order of out-neighbors for each . We say that is a prior subtree of when u and v are both children of a common parent and u comes before v in the local ordering of the out-neighborhood of v. Now, consider two vertices u and v such that u and v are incomparable w.r.t. to the ancestor order and set . Note that u, v, and w are pairwise distinct. Let x and y be the children of w such that and . Then, we say that u is prior to v, in symbols , if is a prior subtree of , i.e., x comes before y in the local ordering of the out-neighborhood of w. The relation ◃ is a partial order known as the sibling partial order of T. The ancestor and the sibling orders are orthogonal, i.e., for any pair of vertices, exactly one of the relations , , , , or is true.
It is well known that the two fundamental traversal orders of trees are obtained as the two natural compositions of the ancestor and the sibling partial orders. Denote by and the order in which vertices are reported in preorder and postorder traversal, respectively. We have:
It follows immediately that preorder and postorder together determine the ancestor and sibling order:
Let G be a digraph. For every vertex , denote by the subset of vertices that are reachable from r, i.e., for which there is a directed path from r to . These paths can be chosen such that every is reachable from r along a unique path, and hence, there is an oriented tree T with that is a subgraph of G. An oriented tree T with root r is a search tree on G if there is no directed edge with and . An ordered tree T is a search tree if and only because a vertex by definition cannot be reached from anywhere in , and thus also not from the root, while every is by definition reachable from the root r.
Depth-first search (DFS) traverses a digraph G in the following manner: (i) pick a root ; (ii) recursively, at , proceed to the ◃ smallest, previously-unvisited out-neighbor of v; (iii) if v has no more unvisited out-neighbors, return to is “parent”, i.e., the vertex from which v was initially reached []. Clearly, DFS generates a rooted tree T with directed edges , which are known as the DFS-tree.
Lemma 1.
Let T be the ordered subtree generated by DFS on a digraph G, and let with . Then, and either (including ), , or . In particular, T is a search tree on G.
Proof.
Consider a DFS reaching u. The search steps up to only after exhausting all out-neighbors of u; hence, any edge either has been visited before by the DFS process or, otherwise, it is included as an edge as DFS steps down to the subtree of u rooted in v. If v has been accessed before, then v is either an ancestor or descendant of u or u and v are incomparable w.r.t. ≺. In the latter case, there are distinct children x and y of such that and . In a DFS, is traversed before if y comes before x in the out-neighbor order of , and thus, .
By the construction of DFS, is reachable from the root r along a path in G; hence, . Suppose there is . Along a path p from r to x, let be the first vertex not reachable from , i.e., there is an edge with and , contradicting the first assertion of the lemma. □
The DFS process proceeds on in such a way that the preorder of the DFS-tree T rooted at r records the order in which the vertices are discovered, while the postorder describes the order in which vertices are completed, i.e., “left”, by ascending back to their parent. To see this, denote by and the order in which vertices are discovered and completed by DFS started at r. By construction, DFS accesses the out-neighbors of v in ◃ order of the children of v and completes the traversal of a subtree rooted at a child of v before proceeding to the subtree of another child. Thus, if u and v are incomparable w.r.t. ≺ in T, then and if and only if in the sibling order. It also follows directly from the definition of DFS that we have if and if . Hence, and indeed coincide with the preorder and the postorder for the traversal of DFS tree T. DFS on a graph G is therefore completely described by the oriented DFS tree T, i.e., the sibling and ancestor order on , and coincides with DFS on T itself.
Hence, the condition that v has been accessed before u can expressed simply as . If u and v are comparable on T, their relative order is determined by Equation (2). We therefore obtain the following simple characterization of DFS-trees:
Corollary 1.
A search tree T with postorder ρ on G is a DFS-tree if and only if an edge is either (i) a tree edge, (ii) an edge connecting two non-adjacent comparable vertices in T, or (iii) whenever u and v are incomparable w.r.t. ≺ in T.
As a consequence, we have the following classification of edges w.r.t. a DFS-tree. is a:
- (i)
- tree edge iff ;
- (ii)
- forward edge iff and , i.e., and ;
- (iii)
- back edge iff , i.e., and ;
- (iv)
- cross edge iff , i.e., and .
2.2. Weak Superbubbloids
Superbubbles [] are a complex generalization of “bubbles”. Comprising two or more isolated paths connecting a source s to a target t, bubbles are the simplest obstacle in sequence assembly problems []. We use here the terminology of []:
Definition 1.
Let G be a digraph, and let be an ordered pair of distinct vertices. Denote by the set of vertices reachable from s without passing through t, and write for the set of vertices from which t is reachable without passing through s. Then, the subgraph induced by is a superbubbloid in G if the following three conditions are satisfied:
- (S1)
- , i.e., t is reachable from s (reachability condition).
- (S2)
- (matching condition).
- (S3)
- is acyclic (acyclicity condition).
We call s, t, and the entrance, exit, and interior of the superbubbloid. We denote the induced subgraph by if it is a superbubbloid with entrance s and exit t.
The reachability and matching conditions can equivalently be expressed in the following form, which usually is more convenient to use:
Lemma 2
([]).Let G be a digraph, , and . Then, U equals the set of Definition 1 and satisfies (S1) and (S2) if and only if the following conditions (S.i)–(S.iv) are satisfied. Moreover, U forms a superbubbloid with entrance s and exit t if and only if (S.i)–(S.vi) are satisfied:
- (S.i)
- Every is reachable from s.
- (S.ii)
- t is reachable from every .
- (S.iii)
- If and , then every path contains s.
- (S.iv)
- If and , then every path contains t.
- (S.v)
- If is an edge in , then every path in G contains both t and s.
- (S.vi)
- G does not contain the edge .
If only (S.i)–(S.v) holds, the is a weak superbubbloid.
A (weak) superbubble is a (weak) superbubbloid that is minimal in the following sense:
Definition 2.
A (weak) superbubbloid is a (weak) superbubble if there is no such that is a (weak) superbubbloid.
Weak superbubbles differ from superbubbles only by (S.vi), which can be checked in constant time for each candidate (weak) superbubble. The effort to recognize superbubbles and weak superbubbles is therefore essentially the same.
The following observation, which summarizes and slightly generalizes our previous analysis [], forms the basis of the present contribution. As in previous work on the topic [,,], DFS-trees are a key ingredient.
Lemma 3.
Let G be a digraph and the vertex set of a weak superbubbloid in G, and suppose r is not an interior vertex or the exit of . Then, either or .
Proof.
(i) Every digraph can be decomposed into strongly-connected components and acyclic components. If , then every vertex reachable from x is also contained in . Thus, in particular, every strongly-connected component of G is either contained in or disjoint from . Sung’s theorem ([] and [] (Thm.1)) ensures that every superbubbloid is either contained in a strongly- connected component C or an acyclic component A of G. Now, suppose , and let be the first vertex of the DFS in . By definition (of weak superbubbloids) , since no other vertex in is reachable from outside , and the DFS assumption does not start at an interior vertex or the exit of . The reachability axiom (S.ii) ensures that every is reached by the DFS whenever , i.e., . □
Lemma 3 is a variant of the key theorem of [].
Corollary 2.
Let be a digraph and the vertex set of a weak superbubbloid in G. Let be such that none of the are an interior or an exit vertex of . Set and . Then, either or .
Proof.
By Lemma 3, is either contained in the intersection of two or more reachable sets or is disjoint from it. As an immediate consequence, it is also either contained in the difference of two reachable sets or disjoint from it. □
Lemma 4.
Let G be a digraph; let be the vertex set of a weak superbubbloid in G; let T be a DFS-tree on G with root ; and let π be the postorder w.r.t. T. Then:
- (i)
- The induced subgraph contains no back edges w.r.t. T, except possibly .
- (ii)
- If , then is an interval w.r.t. to π.
Proof.
(i) The statement is trivial if is not contained in T. If resides in an acyclic part A of G, there are no back edges because A cannot contain back edges by acyclicity. If is contained in a strongly-connected component C, the proof of Lemma 9 of [] also implies Assertion (i) because the DFS-tree T, in particular, contains a DFS-tree of C as a subtree and back edges in G can only be located within a strongly-connected component.
(ii) Since the DFS generating T enters through s and leaves it through t, the preorder satisfies . Since t is reachable from every , we conclude that any DFS reaches t before completing any ; hence, t precedes any other in postorder, i.e., . Since is not reachable without passing through s, every other vertex in precedes s in postorder, i.e., . Now, suppose there is some with . Then, w must be reachable from s along a directed path that does not pass through t, a contradiction to the definition of weak superbubbloids. Hence, the vertices of a superbubbloid form an interval in postorder of the DFS-tree T. □
Statement (ii) rephrases the key result of [], although we do not need to assume that G is an acyclic digraph. Conceptually, Lemma 4 suggests that it might not be necessary to first identify the strongly-connected components of G [] or the construct acyclic auxiliary digraphs [] in order to find all weak superbubbles. Lemma 2 then ensures that a single DFS-forest is sufficient.
2.3. Superbubble Detection
We next show how to retrieve all weak superbubbles of a digraph G that are located within the induced subgraph of G. To this end, we use a slightly modified version of the algorithm DAGsuperbubble described in []. It was originally designed to operate on acyclic auxiliary graphs with a single source. Thus, it could be assumed that a DFS-tree rooted on this source reached all vertices. Here, we intend to apply it to the unmodified input graph, which is neither acyclic, nor guaranteed to have a single source. It, therefore, needs to be modified to deal appropriately with back edges within the DFS-tree and the existence of vertices outside the DFS-tree. To this end, vertices in that cannot be contained in a superbubble have to be identified. By Lemma 4, there are two possible obstructions for a vertex u: (i) u has an edge that is a back edge in the DFS-tree; (ii) u is incident to an edge or where .
The basic idea of DAGsuperbubble is to identify minimal intervals in reverse postorder of the DFS-tree T that satisfy conditions equivalent to membership in a superbubbloid. These conditions are expressed in terms of a pair of helper functions with the help of reverse postorder on T. As in [], denotes the first vertex (w.r.t. reverse postorder ) in T from which v can be reached. Similarly, is the last child vertex reachable from v.
These functions are extended to intervals on as follows:
In [], we derived a characterization of weak superbubbloids in terms of and for the case of acyclic digraphs. Here, we generalize this condition to general graph using the modified definition of and . The difference is that the situation that back edges and edges connecting to the outside of the DFS-tree are considered. In either case, the corresponding vertices are marked by or ∞, respectively, to indicate that they cannot be part of superbubbloids.
Theorem 1.
Let G be a digraph; let T be a DFS-tree on G with a root r that is not an interior vertex or exit of a weak superbubbloid; and denote by the reverse postorder on T. Then, is a weak superbubbloid in G whose vertex set satisfies if and only if the following conditions are satisfied:
- (F1)
- (predecessor property)
- (F2)
- (successor property)
Proof.
It was shown in [] (theorem 2) that the statement is true for acyclic digraphs. We first note that by Lemma 3, every weak superbubbloid intersecting is contained in , i.e., in . For the purpose of the proof, consider the auxiliary graph with edge set . By construction, is acyclic, and every vertex is in T. Thus, every superbubbloid (with vertex set ) in is characterized by Conditions (F1) and (F2). It is, furthermore, a weak superbubbloid in G if and only if the following conditions hold:
- (i)
- For every , there is no edge such that ;
- (ii)
- For every , there is no edge such that ; and
- (iii)
- without the edge acyclic.
Only edges not contained in need to be considered for Conditions (i) and (ii), because no such edges exist within due to the assumption that is a weak superbubbloid in . For (iii), only the back edges are of interest. By definition, a back edge creates a cycle in . A back edge with would violate (iii) when or (i) if . Analogously, if and , then (ii) is violated. Thus, a weak superbubbloid cannot contain the head or tail of a back edge. Only for Condition (i), we also need to consider the case that .
(F1) can be satisfied only if for every . Analogously, (F2) can only be true if for all . Hence, it suffices to rule out false positive weak superbubbloids in G by ensuring that every vertex u that violates one of the three conditions also violates (F1) or (F2). This is achieved by setting for a vertex u if there is an edge such that or is a back edge; analogously, we set for all v with an incident edge such that or is a back edge. Equation (3) implements exactly these conditions. Thus, only weak superbubbloids fulfill (F1) and (F2).
Conversely, it suffices to note that by Lemma 4(ii), every weak superbubbloid forms a contiguous interval w.r.t. the postorder of T and, thus, also w.r.t. the reverse postorder of T. □
We denote by Superbubble the algorithm DAGsuperbubble with the modified functions and as described above. By construction, Superbubble identifies minimal intervals of that satisfy (F1) and (F2); see Figure 1 for an illustration and [] for full details. Since the modification of and only amounts to setting additional entries to or ∞, respectively, the performance remains unaffected. According to Theorem 1, the minimal intervals satisfying (F1) and (F2) are exactly the minimal weak superbubbloids and, thus, by definition, the weak superbubbles. Therefore, we have:
Figure 1.
Illustration of the algorithm Superbubble on a digraph G with cycles. The top panel shows the input digraph. The DFS-tree T is rooted at one and covers . The table below gives the values of and as a function of the reverse postorder of T. In the final line, matching pairs of parentheses indicate entrances and exits of the weak superbubbles in . This corresponds to the intervals that fulfill (F1) and (F2).
Corollary 3.
Let G be a digraph, and let T be a DFS-tree on G with a root r that is not an interior vertex or exit of a weak superbubble. Then,Superbubblecorrectly identifies exactly the weak superbubbles in G whose vertex set satisfies .
It is straightforward to extend this result to a DFS-forest that covers entirely: This forest is constructed by first constructing with root covering . Then, is constructed from a root searching on , and so on; see Lemma 2. This amounts to constructing an auxiliary graph from G by adding an artificial root and out-edges , ,..., , and defining the sibling order of the roots as . The DFS-forest with given roots , ,..., on G is then equivalent to the DFS-tree rooted at on if we define the reverse postorder of such that . We note, furthermore, that sibling order, i.e., the order in which the roots are used to seed DFS, is arbitrary.
Corollary 4.
Let G be a digraph, and let F be a DFS-forest on G comprising DFS-trees with roots , , none of which is an interior vertex or exit of a weak superbubble. Let be the reverse postorder on F, obtained by concatenating the reverse postorders on the constituent DFS-trees. Then,Superbubblecorrectly identifies exactly the weak superbubbles in G. Furthermore, given the roots ,Superbubblehas a running time of .
Proof.
Correctness follows immediately from Corollary 3, the construction of on the auxiliary digraph , and Lemma 2. During the DFS, each out-edge is considered exactly once, and each vertex is traversed twice. The number k of required roots is limited by . For each vertex v, checking whether or requires checking all neighbors only; hence, the total effort is no more than . The linear time complexity of DAGsuperbubble, finally, is proven in []. □
It is important to note the correctness of Superbubble, Corollary 3, crucially depends on the correct choice of the root r of the DFS-tree. The remaining problem thus is to find a suitable sequence of roots , ,..., .
Definition 3.
A vertex is a legitimate root if for every weak superbubble in G with vertex set , we have either and (in the ancestor order of a search tree with root r), or .
We can summarize the discussion in the following form:
Corollary 5.
The algorithmSuperbubbledetects all weak superbubbles in G if and only if there is a set of legitimate roots such that the DFS-forest covers .
Corollary 6.
A vertex is a legitimate root if and only if r is neither an interior nor an exit of a weak superbubble.
Proof.
By Corollary 4, a root is legitimate if it is not the exit or an interior vertex of a weak superbubble. Conversely, if r is an interior vertex or the exit of , then a DFS-tree rooted in r reaches the entrance s either not at all or there is not search tree with root r such that , since by definition of a weak superbubble, the exit t is found before s along every path from r to s. □
Lemma 5.
Let G be a digraph and a source, i.e., a vertex with in-degree zero. Then, v is a legitimate root.
Proof.
Since v is not reachable from any other vertex, it is only reachable by DFS if the traversal starts in v. By the same argument, v is neither an interior vertex, nor the exit of a weak superbubble and, thus, is a legitimate root. □
Unfortunately, there is no guarantee that a digraph G has source vertices, and even if they exist, not every vertex of G is necessarily reachable from them. The task is, therefore, to identify legitimate roots located within strongly-connected components.
2.4. Cycles, -Covers, and -Cuts
Definition 4.
Let G be a digraph. A set is a cycle in G if and . A pair of vertices determines a cycle interval:
By definition, the vertices are pairwise distinct and indexed consecutively along C. Importantly, cycle intervals contain only the interior of the unique path in C connecting the defining endpoints and . Thus, if and for all . The C-distance of two vertices and along a cycle C is the length of the directed path, i.e., the number of edges, from to . More explicitly,
since the number of inner vertices is one less than the number of edges. In particular, for all . The C-distance is not symmetric. Instead, we have for all two vertices . Another useful consequence of the definition of is:
The following implication will be useful later on:
Corollary 7.
Let G be a digraph; let C a cycle in G; and let . Then, if and only if .
Proof.
If , , and are pairwise distinct, the l.h.s. is true if the path from to is a subpath of the path from to , i.e., and, thus, . The converse is obvious. The statement is trivial for . If , the l.h.s. is always true, while on the r.h.s., we have . For , both the l.h.s. and the r.h.s. are satisfied only if . □
In the following, it will be useful to know whether two vertices on a cycle are also reachable via a directed path that is disjoint from C. We formalize this idea as a binary relation on C.
Definition 5.
Let G be a digraph and C a cycle in G and . Then, t isC-reachable from s, in symbols , if there is a path such that and for .
We have used the letters s and t here since C-reachability will be used to identify potential candidates for entrance and exit of superbubbles. C-reachability is defined not only for vertices in the “reference cycle” C. It satisfies a restricted transitivity property: If , , and , then . Another interesting observation is that implies that there is a directed cycle such that . As an immediate consequence of Definition 5, we obtain:
Lemma 6.
Let G be a digraph, C a cycle in G, such that , and . Then, and are connected by (at least) two edge-disjoint directed paths. In particular, .
Definition 6.
Let C be a cycle in the digraph G and . Then, is-covered if .
As an immediate consequence of the definition, implies that is covered, while nothing is covered if .
Consider two C-intervals and on the same cycle C of the digraph G. We say that is included in if , and are disjoint if , and extends if and . In particular, if extends , then , since the interval boundaries themselves are not considered part of the C-intervals. For each pair of distinct C-intervals, thus exactly one of the following four statements is true: (a) the C-interval are disjoint; (b) one C-interval is contained in (i.e., a proper subset of) the other one; (c) one C-interval, say , extends the other one, but not vice versa, i.e., and ; (d) both C-intervals extend each other, i.e., . Figure 2 illustrates the four cases. Note that in Case (c), the interval boundaries are arranged in the order along the cycle, while in Case (d), the arrangement is along C.
Figure 2.
Relationships of distinct C-intervals. The four possibilities for the relative location of two distinct C-intervals are shown on a linear layout of a cycle C with five vertices (). Left top: the C-intervals and are disjoint. Right top: includes . Left bottom: extends , but not vice versa. Right bottom: and extend each other. Together, the two C-intervals cover C.
In the following, we will use the notation:
for the set of all -covered intervals and the set of all -covered vertices of C, respectively. Note that since holds for . By the same argument, there is at least one interval for each , albeit some or even all of these may be empty.
Definition 7.
A subset is a-cover of C if , and is a total -cover of C if . We say that C is totally -covered if C has a total -cover.
Note that C is totally -covered if and only if .
Definition 8.
A vertex in is a-cut vertex.
Obviously, C is either totally -covered or it has a non-empty set of -cut vertices.
Definition 9.
Let C be a cycle in the digraph G. A-cover of C is clean if and implies .
In other words, in a clean -cover, no -covered interval is contained within another one.
Corollary 8.
Let C be a cycle in the digraph G, and let be a clean -cover. Then, either or, for every , .
Proof.
Recall that if and only if . Thus, if and only if there is no with . Since the empty set is a subset of every other set, for every unless . □
Lemma 7.
Let C be a cycle in the digraph G. Then, contains a clean -cover .
Proof.
Let be a set of -covered intervals that together -cover . Suppose is not clean. Then, there are two intervals and such that . Then, still -cover . The removal of such redundant intervals can be repeated until no further removable interval can be found. By Definition 9, the remaining -cover is clean. □
Definition 10.
Let C be a cycle in a digraph G. Then:
By definition, consists of all -covered intervals for which there is no larger -covered interval with the same starting point. Since every is contained in a interval with the same starting point, is a -cover of C. Thus, Lemma 7 implies:
Corollary 9.
Let C be a cycle in a digraph G. Then, there is clean cover .
Lemma 8.
Let C be a cycle in the digraph G. A clean -cover of C is total if and only if and every is extended by at least one .
Proof.
If , then , and thus, is not total. In the following, we assume is a clean -cover. Suppose, for contradiction, that is not extended by any . Then, any interval -covering v would have to contain , contradicting the assumption that is clean. Thus, v is a -cut vertex, and hence, is not total. If , therefore it is non-empty, and every is extended by some .
Conversely, suppose c is a -cut vertex of C. If for some u, then the first part of the proof implies that is not extended by any . If contains no interval , then consider the vertex v such that for which is minimal. Since c is a -cut vertex, there is no extension of , since any such extension would either contradict the minimality of or -cover c, thereby contradicting the assumption that c is a -cut vertex. Thus, v is again a -cut vertex. As shown in the first part of the proof, ; therefore, it is not extended by any . We conclude that unless is a total -cover or , there is an interval without an extension. □
Figure 3 shows an example of a cycle with a total -cover and a cycle with a -cut, respectively. Since the largest -interval in C is for some v, every total -cover comprises at least two -covered intervals.

Figure 3.
-covers. (a) The green cycle C in the top panel has five -paths indicated in red. In the middle panel, C is laid out linearly to emphasize the -covered intervals. Below, the clean -cover obtained by removing all -intervals that are contained in longer ones. Note that every -interval is extended by another one; hence, the -cover is total. (b) Again, the top panel highlights C in green and the -paths in red. The linear layout below highlights that Vertex 1 is not -covered. Thus, it is a -cut vertex.
The following lemma provides us with a convenient way to obtain a total -cover.
Lemma 9.
Let C be a cycle in G, , and with . Then, , , , and imply that is a total clean -cover of C.
Proof.
By construction, and are -covered intervals. By definition, we have , and extends . Since , the two intervals cover all of C. Furthermore, the cover consists of only two intervals that are not subsets of each other; thus, it is clean. □
We will refer to this type of total clean -cover as a single-vertex cover of C. An example is shown in Figure 4.
Figure 4.
One-vertex cover. As in Figure 3, the cycle C and the -paths are highlighted in green and red, respectively. The paths and imply that and are -covered. It is a one-vertex cover conforming to Lemma 9.
2.5. Cycles, -Cover, -Cuts, and Superbubbles
A key result of [] states that every superbubble is either contained in or disjoint of any strongly-connected components. The following results on the interaction of cycles and superbubbles are a generalization of this observation. The acyclicity condition (S.v) can be restated in the following way:
Lemma 10.
Let be a weak superbubbloid in the digraph G and . Then, every cycle containing u also contains s and t.
Proof.
If , then all in-neighbors of u are contained in . Similarly, if , then all out-neighbors of u are contained in . Since every cycle through u contains both in- and out-neighbors of u, it, in particular, contains an edge e in . (S.v) now implies any cycle through e contains both s and t. □
Lemma 11.
Let C be a cycle in the digraph G, and let be a total clean -cover of C. If , then v is neither an interior, nor an exit of a weak superbubble, i.e., v is a legitimate root.
Proof.
Assume, for contradiction, that v is an interior or the exit of the superbubble . Since C is totally -covered by assumption, Corollary 8 implies . Thus, by Lemma 6, there are (at least) two edge-disjoint paths from u to v. Since neither path can leave before passing through s, neither of them contains the entrance s, and hence, both are contained in the weak superbubble. Thus, contains .
Since is a total clean -cover of C, there is an interval that extends , i.e., , and hence, p is an inner vertex of . Therefore, contains , and it again has an extending -interval. Repeating the argument, we conclude that every vertex of is an inner vertex of . Since the cover is total, , i.e., the cycle C consists entirely of interior vertices of , i.e., C is a proper subset of . This contradicts the acyclicity condition (S.v). □
Corollary 10.
Let C be a cycle in the digraph G. Suppose C is totally -covered, and let such that for all . Then, v is a legitimate root.
Proof.
The longest -interval by construction cannot be contained within another -interval. Therefore, is contained in the clean cover of Corollary 9. By Lemma 11, its endpoint v is a legitimate root. □
Let us now turn to cycles with -cut vertices:
Lemma 12.
Let C be a cycle of the digraph G, and let c be a -cut point of C, i.e., . Then, c is not an interior vertex of any weak superbubble.
Proof.
Assume, for contradiction, that c is an interior vertex of a weak superbubble . Then, there is a path p from s to t not passing through c. Otherwise, is a superbubbloid, contradicting the assumption that is a weak superbubble; see corollary 5 in []. Along p, let u be the last vertex on C before c, and let v be the first vertex on C after c. Thus, . Therefore, c is -covered in C, a contradiction. □
The example in Figure 5 shows that it is possible that every entrance of superbubble is at the same time the exit of another superbubble. Such graphs do not have any legitimate root. Nevertheless, it is easily possible to obtain all the superbubbles. To this end, fix a -cut vertex c for some cycle C in G, and consider the auxiliary digraph obtained from G by splitting c into two vertices and so that retains only the in-edges and retains only the out-edge.
Figure 5.
A digraph G without any legitimate root. In G are 16 isomorphic cycles containing eight of the 12 vertices, all of which contain . The superbubbles , , , and cover G entirely, i.e., every entrance of a superbubble is also the exit of another one, and all other vertices are interior vertices of a superbubble.
Lemma 13.
Let C be a cycle in the digraph G, a -cut vertex, and the digraph obtained from G by splitting c. If is a weak superbubble in G, then it is also a weak superbubble in , where c as an entrance in G corresponds to in and c as an exit in G corresponds to in . Conversely, every weak superbubble with in is also a weak superbubble in G.
Proof.
For the proof, we construct the auxiliary graph by inserting the edge into . Then, there is a 1-1 relationship between the set of paths in G and the set of paths that do not start or end with the edge in , which is constructed as follows: If p starts at c in G, it starts in in ; if p ends at c in G, it ends at in ; and if p runs through c in G, then it runs through the edge in . The 1-1 correspondence of weak superbubbles now follows immediately from the equivalence of the path systems in G and since reachability is the same for every pair , with c as the starting point corresponding to and c as the endpoint corresponding to . Thus, G and have the same superbubbles, except possibly for the ones with in . Now, consider a DFS-tree on rooted in . The edge is not a tree edge and necessarily appears as a back edge. Since c is a -cut vertex, and are not interior vertices of any weak superbubble in . Thus, the edge does not affect any weak superbubble of , and thus, and have the same weak superbubbles, except possibly the ones with . □
The only potential differences between the weak superbubbles of G and is, therefore, the possibility that contains or as an additional weak superbubble. Of course, it is easy to detect and remove the additional weak superbubble. Since is a source in , we can apply Superbubble to and remove the possible spurious weak superbubble in order to obtain the correct set of weak superbubbles of G. In contrast to the auxiliary digraph constructions suggested in [], contains only a single extra vertex instead of doubling the size. More importantly, however, it not necessary to construct explicitly. Instead, on can modify the DFS starting at c in G in the following manner: when c is encountered for the first time as an out-neighbor of a tree vertex u, then is inserted as with parent u and no further out-neighbors, with only a constant overhead. The algorithm Superbubble applied to extracts the minimal intervals satisfying (F1) and (F2) (w.r.t.) the reverse postorder of the DFS-tree rooted as , and thus correctly identifies the weak superbubbles of . The modified DFS on G rooted at c by construction yields the same DFS-tree on , and thus the same reverse postorder. Together with setting , , , and , Superbubble operating on the modified DFS-tree thus correctly identifies the weak superbubbles in . We refer to this algorithm, which is equivalent to applying Superbubble to , as Superbubble#.
Definition 11.
Let G be a digraph. Then, is a quasi-legitimate root if either:
- (i)
- r is source in G,
- (ii)
- r is the end point of an interval of a total clean -cover of some cycle C in G, or
- (iii)
- r is -cut vertex of some cycle C in G.
Our discussion so far can be summarized as:
Corollary 11.
Algorithm Superbubble# correctly identifies the superbubbles in if and only if r is a quasi-legitimate root.
As an immediate consequence of Lemmas 11 and 12, every cycle contains a quasi-legitimate root. Recalling that every vertex in the digraph G can be reached either from a source vertex or from a cycle, we finally obtain:
Theorem 2.
Every digraph G contains a set of quasi-legitimate roots . Given these roots, the algorithmSuperbubble# correctly identifies all superbubbles of G in linear time.
It remains to show, therefore, that a suitable set of roots can be identified in linear time. Clearly, this is possible for the sources. For superbubbles that cannot be reached from a source vertex, a suitable set of cycles needs to be identified.
Lemma 14.
Let be an arbitrary DFS-forest of G with constituent ordered trees rooted at , and let C be a cycle in G. Then, implies , and there is a such that .
Proof.
Let be the first root of F that can reach any vertex of C. Then, by definition of a cycle, . Thus, . Further, let u be the first vertex that is reached from in the DFS. Then, every other vertex of C is reached from u in the DFS. Thus, . □
The same is true for strongly-connected components:
Lemma 15.
[] (corollary 11) Let S be a strongly-connected component in G, and let T be a DFS-tree with . Then, there is a vertex such that . We call v the root of the strongly-connected component S in T.
Our aim is now to find a set of “start cycles” such that every cycle C is reachable from at least one of these start cycles.
Lemma 16.
Let T be a DFS-tree on the digraph G rooted in v, and let W be the set of ≺-maximal vertices w that have an incoming back edge . Then, (i) is contained in a cycle, and (ii) every cycle is satisfied for some .
Proof.
Property (i) is an immediate consequence of the definition of DFS. Now, suppose for some . Then, by construction, none of the vertices along the path from the root v to u have an incoming back edge, and thus, neither u, nor one of its ancestors are contained in a cycle. Thus, if for some cycle , then a vertex exists such that , and thus, . □
Note that if does not contain a cycle. Since the vertex set of every cycle in the digraph G is necessarily contained in one of the constituent trees of a DFS-forest, we immediately obtain:
Corollary 12.
Let F be a DFS-forest on the digraph G, and let W by the set of ≺-maximal vertices w that have an incoming back edge . Then, (i) is contained in a cycle, and (ii) every cycle C in G is satisfied for some and some .
Lemma 17.
A set of cycles from which all cycles in G are reachable can be constructed in time.
Proof.
The DFS-forest F on the digraph G is obtained in time. The set W is easily identified by a preorder traversal of F omitting a subtree as soon as a vertex w has an incoming back edge. The worst-case effort is since we only traverse the forest, not the entire digraph G. Given W and the associated back edges identified in the previous step for each , the cycle is explicitly retrieved by following the parent links of F from back to in time. □
Lemma 17 ensures that a sufficient set of cycles can be found in linear time. More precisely, using the sources of G and a quasi-legitimate root in each cycle as roots, the algorithm Superbubble# correctly identifies all superbubbles in G in linear time. It remains to show that we can identify a quasi-legitimate root in a cycle .
2.6. Identification of Quasi-Legitimate Roots
The obvious approach to identify quasi-legitimate roots is to construct a clean -cover. The obvious starting point is since it requires the construction of no more than the -path. This can be achieved in polynomial time, e.g., using an independent DFS-tree rooted at that ignores the edges of C. This naive approach, however, exceeds linear time even for a single cycle.
For , we construct a modified DFS-tree by excluding all other vertices of C from G. By construction, is -reachable from c if and only if contains an in-neighbor of u, i.e., there is an edge with .
For each , we are interested in the vertices and that are -reachable from v and minimize and maximize and . These can be recursively computed on by traversing in postorder. For each , and are obtained by comparing the and values for the out-neighbors of v along T, and the vertices reachable directly from v. More precisely, at each leaf v of , is initialized by the vertex such that and maximizes . At each inner vertex v of , is computed as the vertex maximizes from the following set of candidates: . The vertex -reachable from c with the maximal value of is thus . The same computations are used for , except that is minimized instead of maximized. The computations of and values of and clearly can be performed in linear time. Repeating this for each , however, will, in general, exceed linear time since the length is not bounded in general.
We can mostly reuse the information stored in , however. A crucial observation is the following:
Lemma 18.
Let C be a cycle of the digraph G; consider two distinct cycle vertices ; and let with and . If , then and . Otherwise, forms a one-vertex -cover.
Proof.
For simplicity, we write and . By definition of and , we have (1) , and (2) for every satisfying , we have . Starting from Property (1), Corollary 7 implies . As a consequence, for every , we have . Since is just a constant, implies for all .
First, assume . Then, Corollary 7 implies . The same arguments as for show that implies , which in turn implies for all . Because of Property (2), this implication can be used in particular for every for which might hold. Therefore, the same two vertices minimize and maximize and , and thus, we arrive at and .
Now, suppose . Then, (otherwise, the distances would be equal), and Corollary 7 implies . Since , we obtain . By Lemma 9, is a one-vertex cover of C. □
The use of Lemma 18 is that it allows either to use the and values also for , or we obtain a one-vertex -cover, which immediately provides us with a legitimate root according to Lemma 11. Thus, we need to continue the computation of and only until we encounter a one-vertex cover. Up to this point, the values of and are independent of by Lemma 18.
The difficulty is to compute the and for all correctly. We have already seen above how to handle tree edges. Forward-edges in do not effectively contribute, because the same information (minimization or maximization over values of ) is also propagated stepwise along the tree-edges. Cross edges, on the other hand, could add information. Postorder traversal ensures, however, that the pertinent information at their starting points is already computed in time to include them to compute the correct value, i.e., we simply have to include the cross-edges in the minimization/maximization step.
Back edges are problematic when belonging to the same strongly-connected component S as . In this case, they can be reached from a cycle vertex and themselves reach a cycle vertex . Such back edge, therefore, influence which cycle vertices are reachable. To handle this information, S is split into parts that are strongly connected components under the use of -reachability. More precisely, we define a -SCC as a strongly-connected component on the induced subgraph .
Consider the auxiliary graph with vertex set and all edges of , as well as all edges with . Then, c is not contained in a cycle of , and thus, the SCC of are exactly the -SCC and the single vertex c. By construction, is also a DFS-tree for . Thus, Tarjan’s DFS-based SCC-detection algorithm (see Lemma 15) on identifies the -SCC as the SCC of . To mimic the traversal on instead on , the graph on which was originally defined, it suffices to ignore the back edge leading to the root, i.e., edges of the form for . It is thus not necessary to construct the graph explicitly.
The definitions of and imply:
Corollary 13.
Let C be a cycle in the digraph G; let be a modified DFS-tree rooted at ; and let S be a -SCC with . Then, and are independent of v for every .
This begs the question of whether the v-independent values of and can be obtained while traversing G. A partial answer is provided by:
Corollary 14.
Let C be a cycle in the digraph G; let be a modified DFS-tree rooted at ; and let v be the root of a -SCC. Suppose the values of and are known for . Then, and are obtained correctly by postorder traversal of considering all tree and cross edges.
Proof.
The only missing information could be a back edge with and . Such a back edge cannot exist because v is by assumption the root of a -SCC, and thus, there is no cycle including u, v, and . □
This observation yields a simple solution to obtain the correct entries for and for every : determine the -SCC and its root v, and set and .
Tarjan [] showed that SCC can be found efficiently by DFS. Below, we will modify the approach slightly to operate on a given DFS-tree. We therefore briefly outline Tarjan’s SSC algorithm; for full details, we refer to []: First, the vertices are enumerated in preorder. Then, a postorder traversal is used to compute, for each v, the lowlink , which is recursively defined as:
A cross edge is only included if it is “unfinished”, i.e., if its endpoint w has not been reported as part of a previously-completed SCC. A vertex v is the root of an SSC if . Tarjan’s SSC algorithm now uses a stack to iterate over every vertex of the SCC S to mark them as finished. This cannot be done in the same way in a predefined DFS-tree.
The stack can be replaced, however, by an equally-efficient iterative method: Starting from v with , simple traverse starting at v; report all “unfinished” vertices as members of the SSC; and omit every subtree rooted in a “finished” vertex. To see that this is correct, note that for all , and hence, w is “unfinished” when the postorder traversal encounters v. Lemma 15 implies that there is a path from v to w in T, with and thus also “unfinished”. Thus, if u is “finished”, so are all its descendants, and the subtree does not need to be considered. The only difference from Tarjan’s SSC algorithm tree traversal is to retrieve S, which considers every edge of T once and thus runs in a total time of . We summarize the discussion above as:
Lemma 19.
The modified version of Tarjan’s SCC algorithm correctly identifies all strongly-connected components in T in time.
Since the correct values of and are computed by postorder traversal of , they are already available when the root v of a -SCC is encountered. Thus, identification of the -SCC and the computation of and can be combined in the same tree traversal. The same tree traversal also guarantees that for every cross edge , we have either (i) u and w in the same -SCC or (ii) the values of and are computed correctly.
Now, consider the vertex along C, and suppose we have not encountered a one-vertex -cover so far. Let be the DFS-tree rooted in that ignores all vertices already included in a previous DFS-tree. As for , we can compute and with along this tree. Then, either equals computed on or for some u such that , depending on which has the smaller value of , and either equals computed on or , depending on which has the larger value of . Note that and do not actually depend on i. In a practical implementation, it is simply stored in dependence of v. The index only is used to keep track of the individual, disjoint DFS-trees rooted in in our arguments.
After processing all vertices of C, we have either found a one-vertex -cover of C, or we know, for every , the largest -covered interval . Thus, we directly conclude:
In particular, we have shown that for each C, or a one-vertex cover can be constructed in linear time.
To detect a quasi-legitimate root, it is necessary to first decide whether C has a total -cover or a non-empty set of -cut vertices exists. To this end, a clean -cover can be used efficiently. Recall that by Lemma 8, every interval in a clean -cover is extended by at least one other interval from the -cover. Since a clean -cover contains at most intervals, it is easy to check in linear time whether a -cut vertex exists: starting from an arbitrary , we initialize the upper bound of the -covered part of C that starts at the successor of u by . For every with , we check whether , in which case a total cover is found, and otherwise, we update x with . If no total cover is found when the intervals are exhausted, then x is a -cut vertex (see the proof of Lemma 8). With the stored, e.g., as array , a total cover or the -cut vertex x is found in operations.
In practice, however, we do not have access to a clean -cover. However, can be computed in linear time. By Corollary 9, there is a clean -cover . We can thus use the same procedure. The redundant intervals in are, by definition, contained within intervals belonging to , and thus, they do not change the results provided the initial interval is contained in the clean cover . By Corollary 10, this is true for the longest interval . Since contains at most intervals, the longest interval and a cut point or the validation of a total cover can be computed in . When it is a total -cover, the longest interval is contained in a total clean cover, and thus, v is a legitimate root by Lemma 10. Thus, a quasi-legitimate root v can be retrieved in time. The entire procedure is summarized in Algorithm 1.
Lemma 20.
Given a cycle C in the digraph G, Algorithm 1 identifies a quasi-legitimate root in C in linear time w.r.t. the size of , the induced subgraph of G reachable from C.
Proof.
The correctness of the algorithm follows from the discussion in the previous paragraphs. The construction of DFS-trees together is linear in the size of since each edge in is considered once. The recursive computation along each is also linear. Since the are disjoint, the total effort is still linear. □
Finally, we note that by construction, no vertex in reaches any cycle disjoint from . Hence, when processing the next cycle , the vertices (and edges) already visited in the context of processing C are irrelevant, and thus, can be disregarded. In other words, the DFS for the next cycle can be performed in the same digraph G, with all previously processed induced subgraphs marked as finished. This ensures an overall linear running time for the identification of starting points for all cycles as in Lemma 17.
| Algorithm 1 computes a -cover and determines , as well as a quasi-legitimate root in C. |
|
2.7. Putting It All together
Theorem 3.
Algorithm 2 correctly identifies the superbubbles of a digraph G in linear time.
Proof.
Theorem 2 ensures that for every digraph G, there is a set R of quasi-legitimate roots such that, given R, the algorithm Superbubble# identifies all superbubbles of G in linear time. Every vertex in is reachable from a source or a cycle in G. By Lemma 5, all sources are legitimate roots. Lemma 17 shows that a set of cycles can be constructed in linear time from which all vertices of G can be reached by DFS. Algorithm 1 identifies a quasi-legitimate root in a cycle (Lemma 20). As discussed in the text following Lemma 20, the effort for this step is again linear in size of G. Algorithm 2 therefore correctly identifies the superbubbles of a digraph G and does so in time. □
| Algorithm 2 Identification of all superbubbles in an arbitrary digraph G. |
|
3. Results
We extended the “Linear Superbubble Detection” (Https://Github.Com/Fabianexe/Superbubble) software LSD [] with the new algorithm presented in the previous section. LSD is written in Python and uses the NetworkX package [] to handle graph data structures. Since the same data structures are used, benchmarking the different algorithms provided in LSD allows a fair comparison of running times.
In the implementation, we deviated from the presentation above in two minor details. First, instead of using the reverse postorder of the DFS-tree, we directly used postorder and the corresponding (trivial) redefinitions of the helper functions and . Second, we did not completely separate the determination of the cycles, the identification of the roots, and the identification of the superbubbles. Instead, we performed cycle search, root detection, and superbubble identification immediately for each DFS-tree. Since cycles and superbubbles are necessarily completely contained within the DFS-trees, this does not affect the correctness of the algorithm. As a by-product, we obtained a speedup by a constant factor because cycles reachable within a given DFS-tree were marked as “already processed” in the superbubble detection step and hence were not (superfluously) considered as candidate additional roots.
In order to benchmark the direct detection algorithm in comparison to other linear-time superbubble detection algorithms, we used the same datasets as in our previous work []. In order to guarantee comparability, performance data for all algorithms were computed with the same version of LSD on the same hardware. The results are summarized in Table 1.
Table 1.
Comparison of running times. The five combinations of algorithms compared here are: Db (Directbubble) refers to the new approach described in this contribution. LSD (using the auxiliary graphs and the stack-based superbubble detector) refers to the algorithm proposed in []. S + LSD combines the Sung graphs as auxiliary graphs [] with LSD stack-based detector plus a post-filter for the false positives. LSD + B uses the LSD graph construction with the range-query-based detector of [], and S + B uses Sung graphs together with the range-query-based detector, as well as the necessary post-filters; see [] for full details. All computations were performed on a 2.5-GHz quad-core Intel Core i7 processor (Turbo Boost up to 3.7 GHz) with 6-MB shared L3 cache and 16 GB of 1600-MHz DDR3L onboard memory. Test datasets were taken from [] and from the Stanford Large Network Dataset Collection []. For each test graph, we list the number of vertices N, the numbers of edges M, and the number S of superbubbles.
For most datasets, we observed an approximately three-fold speedup of Directbubble compared to LSD. The exception is the Slashdot dataset for which no performance gain was observed.
To understand this outlier, it is necessary to understand the source of the speedup in the other test cases. In a typical case, both Directbubble and LSD performed three depth-first searches: in LSD, they are used to determine SCCs, create auxiliary graphs, and detect superbubbles. Directbubble uses them to identify the cycles, quasi-legitimate roots, and finally the superbubbles. Both need to handle exceptional cases. LSD requires the construction of the Sung graph if an SCC coincides with a connected component of the input graph (rather than being just part of it). Since the Sung graph is twice the size of the SCC, this roughly doubles the running time. Directbubble behaves exceptionally for vertices that are reachable from a source. In this case, the detection of cycles and quasi-legitimate roots in cycles was skipped, incurring a substantial speedup. When a graph had neither an SCC that was also a connected component, nor large subgraphs reachable from a source, then LSD and Directbubble essentially performed the computations and thus performed very similarly. The Slashdot dataset is such a case. Typically, however, directed graphs have some sources so that Directbubble outperforms its competitors on most real-life graphs.
4. Conclusions
In this contribution, we extended the body of results describing properties of superbubbles, a particular class of induced subgraphs of a digraph. The analysis presented here was motivated by the observation that in principle, all superbubbles in G can be identified in linear time in a single depth-first search, provided the roots of the individual DFS-trees are known beforehand. Our main result is the observation that a suitable set of starting points, which we call quasi-legitimate roots, (1) always exists in every given digraph and (2) can be identified in linear time, using two additional DFSs. In the first pass, a suitable set of cycles is constructed such that every node in G is reachable from a source vertex of one of these cycles. In the second pass, a peculiar structure of “detours” in a cycle C is used to identify quasi-legitimate roots in a given cycle. To this end, we defined a notion of -reachability that may also be interesting in its own right to characterize (short) cycles.
A comparison of running times of Directbubble and previous approaches shows that practically useful performance gains are obtained essentially from two sources: (1) we dispense with the construction of auxiliary graphs and (2) we can avoid most of the processing for all vertices reachable from a source in G. In practice, we observed a speedup of about a factor of three on most, but not all, benchmark cases. In all cases, Directbubble performed at least as good as all competing algorithms for superbubble detection.
Author Contributions
F.G. and P.F.S. designed the study, developed the theoretical results, and wrote the manuscript. F.G. implemented the algorithm and evaluated its performance.
Funding
This work was funded by the German Federal Ministry of Education and Research within the project Competence Center for Scalable Data Services and Solutions (ScaDS) Dresden/Leipzig (BMBF 01IS14014B). The authors acknowledge support from the German Research Foundation (DFG) and Universität Leipzig within the program of Open Access Publishing.
Conflicts of Interest
The authors declare no conflict of interest.
References
- Paten, B.; Eizenga, J.M.; Rosen, Y.M.; Novak, A.M.; Garrison, E.; Hickey, G. Superbubbles, Ultrabubbles, and Cacti. J. Comput. Biol. 2018, 25, 649–663. [Google Scholar] [CrossRef] [PubMed]
- Onodera, T.; Sadakane, K.; Shibuya, T. Detecting superbubbles in assembly graphs. In Proceedings of the International Workshop on Algorithms in Bioinformatics, Sophia Antipolis, France, 2–4 September 2013; Darling, A., Stoye, J., Eds.; Springer: Berlin/Heidelberg, Germany, 2013; Volume 8126, pp. 338–348. [Google Scholar] [CrossRef]
- Simpson, J.T.; Pop, M. The Theory and Practice of Genome Sequence Assembly. Annu. Rev. Genomics Hum. Genet. 2015, 16, 153–172. [Google Scholar] [CrossRef] [PubMed]
- Baichoo, S.; Ouzounis, C.A. Computational complexity of algorithms for sequence comparison, short-read assembly and genome alignment. Biosystems 2017, 156–157, 72–85. [Google Scholar] [CrossRef] [PubMed]
- Sung, W.K.; Sadakane, K.; Shibuya, T.; Belorkar, A.; Pyrogova, I. An O(mlogm)-time algorithm for detecting superbubbles. IEEE/ACM Trans. Comput. Biol. Bioinf. 2015, 12, 770–777. [Google Scholar] [CrossRef] [PubMed]
- Brankovic, L.; Iliopoulos, C.S.; Kundu, R.; Mohamed, M.; Pissis, S.P.; Vayani, F. Linear-time superbubble identification algorithm for genome assembly. Theor. Comput. Sci. 2016, 609, 374–383. [Google Scholar] [CrossRef]
- Gärtner, F.; Müller, L.; Stadler, P.F. Superbubbles revisited. Algorithms Mol. Biol. 2018, 13, 16. [Google Scholar] [CrossRef] [PubMed]
- Tarjan, R. Depth-First Search and Linear Graph Algorithms. SIAM J. Comput. 1972, 1, 146–160. [Google Scholar] [CrossRef]
- Acuña, V.; Grossi, R.; Italiano, G.F.; Lima, L.; Rizzi, R.; Sacomoto, G.; Sagot, M.F.; Sinaimeri, B. On Bubble Generators in Directed Graphs. In Graph-Theoretic Concepts in Computer Science, 43rd ed.; Bodlaender, H.L., Woeginer, G.J., Eds.; Lecture Notes in Computer Science; Springer: Heidelberg, Germany, 2017; Volume 10520, pp. 18–31. [Google Scholar]
- Hagberg, A.; Schult, D.A.; Swart, P. Exploring network structure, dynamics, and function using NetworkX. In Proceedings of the 7th Python in Science Conference (SciPy 2008), Pasadena, CA, USA, 19–24 August 2008; pp. 11–16. [Google Scholar]
- Gärtner, F.; Höner zu Siederdissen, C.; Müller, L.; Stadler, P.F. Coordinate Systems for Supergenomes. Algorithms Mol. Biol. 2018, 13, 15. [Google Scholar] [CrossRef] [PubMed]
- Leskovec, J.; Krevl, A. SNAP Datasets: Stanford Large Network Dataset Collection. Available online: http://snap.stanford.edu/data (accessed on 26 November 2018).
© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).