Next Article in Journal
Image Error Concealment Based on Deep Neural Network
Previous Article in Journal
An Improved Squirrel Search Algorithm for Global Function Optimization
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Direct Superbubble Detection

by
Fabian Gärtner
1,2,* and
Peter F. Stadler
1,2,3,4,5,6,7
1
Competence Center for Scalable Data Services and Solutions Dresden/Leipzig, Universität Leipzig, Augustusplatz 12, D-04107 Leipzig, Germany
2
Bioinformatics Group, Department of Computer Science, Universität Leipzig, Härtelstraße 16–18, D-04107 Leipzig, Germany
3
Interdisciplinary Center for Bioinformatics, German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, and Leipzig Research Center for Civilization Diseases, University Leipzig, D-04107 Leipzig, Germany
4
Max Planck Institute for Mathematics in the Sciences, Inselstraße 22, D-04103 Leipzig, Germany
5
Institute for Theoretical Chemistry, University of Vienna, Währingerstraße 17, A-1090 Wien, Austria
6
Facultad de Ciencias, Universidad National de Colombia, Sede Bogotá, Colombia
7
Santa Fe Institute, 1399 Hyde Park Rd., Santa Fe, NM 87501, USA
*
Author to whom correspondence should be addressed.
Algorithms 2019, 12(4), 81; https://doi.org/10.3390/a12040081
Submission received: 7 March 2019 / Revised: 10 April 2019 / Accepted: 12 April 2019 / Published: 17 April 2019

Abstract

:
Superbubbles are a class of induced subgraphs in digraphs that play an essential role in assembly algorithms for high-throughput sequencing data. They are connected with the remainder of the host digraph by a single entrance and a single exit vertex. Linear-time algorithms for the enumeration superbubbles recently have become available. Current approaches require the decomposition of the input digraph into strongly-connected components, which are then analyzed separately. In principle, a single depth-first search could be used, provided one can guarantee that the root of the depth-first search (DFS)-tree is not itself located in the interior or the exit point of a superbubble. Here, we describe a linear-time algorithm to determine suitable roots for a DFS-forest that is guaranteed to identify the superbubbles in a digraph correctly. In addition to the advantages of a more straightforward implementation, we observe a nearly three-fold gain in performance on real-world datasets. We present a reference implementation of the new algorithm that accepts many commonly-used input formats for digraphs. It is available as open source from github.

1. Introduction

Bubble structures in a digraph have become the focus of an increasing body of research because of their role in genome assembly and related topics; see, e.g., [1] and the references therein. Onodera et al. [2] proposed superbubbles as an important class of subgraphs in the de Bruijn and overlap digraphs arising in the context of the assembly of high-throughput sequencing data [3,4]. The algorithm identifying all superbubbles in a digraph G with vertex set V and edge set | E | had a running time O ( | V | ( | V | + | E | ) ) [2]. An improvement to O ( | E | log | E | ) was described in [5]. A linear time algorithm for an acyclic subgraph together with the construction of auxiliary digraphs along the lines of [5] provided a solution in O ( | E | + | V | ) , i.e., linear, overall time [6]. An alternative linear-time algorithm [7] achieves a substantial speedup and does not require sophisticated data structures. All these approaches rely on the decomposition of G into its strongly-connected components and require the construction of intermediate auxiliary digraphs. Here, we show that the subdivision of the problem, as well as the construction of auxiliary digraphs, can be avoided. This additional simplification yields a further performance gain.

2. Theory

2.1. Oriented Trees and DFS-trees

A directed graph (digraph) G consists of a vertex set V = V ( G ) and a set E = E ( G ) of directed edges such that ( u , v ) E ( G ) implies u , v V ( G ) . H is a sub-digraph of G if V ( H ) V ( G ) and E ( H ) E ( G ) . G [ W ] is the subgraph induced by W if V ( G [ W ] ) = W V ( G ) and ( u , v ) E ( G [ W ] ) if and only if ( u , v ) E ( G ) and u , v W .
An oriented tree T is a connected digraph in which there is a single vertex, called the root with indegree zero, and every other vertex has in-degree one. The vertices with out-degree zero are the leaves. Given an edge ( u , v ) E ( T ) , we say u is the parent of v, in symbols u = p a r ( v ) , while v is a child of u. By definition, there is a unique directed path p T ( v ) from r to v V ( T ) . The ancestor partial order ⪯ on V ( T ) is defined by u v if and only if v is on the path p T ( u ) . The least common ancestor (lca) of two vertices x , y V ( T ) is the ≺-minimal vertex in p T ( x ) p T ( v ) . The subtree T ( v ) rooted in v is the subgraph of T induced by the vertex set { v V | v v } , i.e., those that are reachable along the directed path that contain v.
We assume that G is endowed with an arbitrary order of out-neighbors for each v V ( T ) . We say that T ( u ) is a prior subtree of T ( v ) when u and v are both children of a common parent w = p a r ( u ) = p a r ( v ) and u comes before v in the local ordering of the out-neighborhood of v. Now, consider two vertices u and v such that u and v are incomparable w.r.t. to the ancestor order and set w = lca ( u , v ) . Note that u, v, and w are pairwise distinct. Let x and y be the children of w such that u T ( x ) and y T ( y ) . Then, we say that u is prior to v, in symbols u v , if T ( x ) is a prior subtree of T ( y ) , i.e., x comes before y in the local ordering of the out-neighborhood of w. The relation ◃ is a partial order known as the sibling partial order of T. The ancestor and the sibling orders are orthogonal, i.e., for any pair of vertices, exactly one of the relations x = y , x y , y x , x y , or y x is true.
It is well known that the two fundamental traversal orders of trees are obtained as the two natural compositions of the ancestor and the sibling partial orders. Denote by ρ and π the order in which vertices are reported in preorder and postorder traversal, respectively. We have:
ρ ( x ) < ρ ( y ) iff x y or y x π ( x ) < π ( y ) iff x y or x y
It follows immediately that preorder and postorder together determine the ancestor and sibling order:
x y iff ρ ( x ) < ρ ( y ) and π ( x ) < π ( y ) x y iff ρ ( x ) > ρ ( y ) and π ( x ) < π ( y )
Let G be a digraph. For every vertex r V ( G ) , denote by V [ r ] V ( G ) the subset of vertices that are reachable from r, i.e., for which there is a directed path from r to x V [ r ] . These paths can be chosen such that every x V [ r ] is reachable from r along a unique path, and hence, there is an oriented tree T with V ( T ) = V [ r ] that is a subgraph of G. An oriented tree T with root r is a search tree on G if there is no directed edge ( x , y ) E ( G ) with x V ( T ) and y V ( T ) . An ordered tree T is a search tree if and only V ( T ) = V [ r ] because a vertex y V ( G ) \ V ( T ) by definition cannot be reached from anywhere in V ( T ) , and thus also not from the root, while every y V ( T ) is by definition reachable from the root r.
Depth-first search (DFS) traverses a digraph G in the following manner: (i) pick a root r V ( G ) ; (ii) recursively, at v V ( G ) , proceed to the ◃ smallest, previously-unvisited out-neighbor of v; (iii) if v has no more unvisited out-neighbors, return to is “parent”, i.e., the vertex p a r ( v ) from which v was initially reached [8]. Clearly, DFS generates a rooted tree T with directed edges ( p a r ( v ) , v ) , which are known as the DFS-tree.
Lemma 1.
Let T be the ordered subtree generated by DFS on a digraph G, and let ( u , v ) E ( G ) with u V ( T ) . Then, v V ( T ) and either v u (including ( u , v ) E ( T ) ), u v , or v u . In particular, T is a search tree on G.
Proof. 
Consider a DFS reaching u. The search steps up to p a r ( u ) only after exhausting all out-neighbors of u; hence, any edge ( u , v ) either has been visited before by the DFS process or, otherwise, it is included as an edge as DFS steps down to the subtree of u rooted in v. If v has been accessed before, then v is either an ancestor or descendant of u or u and v are incomparable w.r.t. ≺. In the latter case, there are distinct children x and y of lca ( u , v ) such that u T ( x ) and v T ( y ) . In a DFS, T ( y ) is traversed before T ( x ) if y comes before x in the out-neighbor order of lca ( u , v ) , and thus, v u .
By the construction of DFS, x V ( T ) is reachable from the root r along a path in G; hence, V ( T ) V [ r ] . Suppose there is x V [ r ] \ V ( T ) . Along a path p from r to x, let x be the first vertex not reachable from V ( T ) , i.e., there is an edge ( u , x ) E ( G ) with u V ( T ) and x V ( T ) , contradicting the first assertion of the lemma. □
The DFS process proceeds on V [ r ] in such a way that the preorder ρ of the DFS-tree T rooted at r records the order in which the vertices are discovered, while the postorder π describes the order in which vertices are completed, i.e., “left”, by ascending back to their parent. To see this, denote by ρ and π the order in which vertices are discovered and completed by DFS started at r. By construction, DFS accesses the out-neighbors of v in ◃ order of the children of v and completes the traversal of a subtree rooted at a child v of v before proceeding to the subtree of another child. Thus, if u and v are incomparable w.r.t. ≺ in T, then ρ ( u ) < ρ ( v ) and π ( u ) < π ( v ) if and only if u v in the sibling order. It also follows directly from the definition of DFS that we have ρ ( u ) < ρ ( v ) if v u and π ( u ) < π ( v ) if u v . Hence, ρ and π indeed coincide with the preorder ρ and the postorder π for the traversal of DFS tree T. DFS on a graph G is therefore completely described by the oriented DFS tree T, i.e., the sibling and ancestor order on V [ r ] , and coincides with DFS on T itself.
Hence, the condition that v has been accessed before u can expressed simply as ρ ( v ) < ρ ( u ) . If u and v are comparable on T, their relative order is determined by Equation (2). We therefore obtain the following simple characterization of DFS-trees:
Corollary 1.
A search tree T with postorder ρ on G is a DFS-tree if and only if an edge ( u , v ) E ( G [ V ( T ) ] ) is either (i) a tree edge, (ii) an edge connecting two non-adjacent comparable vertices in T, or (iii) ρ ( v ) < ρ ( u ) whenever u and v are incomparable w.r.t. ≺ in T.
As a consequence, we have the following classification of edges w.r.t. a DFS-tree. ( u , v ) E ( G [ V ( T ) ] ) is a:
(i)
tree edge iff ( u , v ) E ( T ) ;
(ii)
forward edge iff ( u , v ) E ( T ) and v u , i.e., π ( v ) < π ( u ) and ρ ( v ) > ρ ( u ) ;
(iii)
back edge iff u v , i.e., π ( v ) π ( u ) and ρ ( v ) ρ ( u ) ;
(iv)
cross edge iff u v , i.e., π ( v ) < π ( u ) and ρ ( v ) < ρ ( u ) .

2.2. Weak Superbubbloids

Superbubbles [2] are a complex generalization of “bubbles”. Comprising two or more isolated paths connecting a source s to a target t, bubbles are the simplest obstacle in sequence assembly problems [9]. We use here the terminology of [7]:
Definition 1.
Let G be a digraph, and let ( s , t ) be an ordered pair of distinct vertices. Denote by U s t the set of vertices reachable from s without passing through t, and write U t s + for the set of vertices from which t is reachable without passing through s. Then, the subgraph G [ U s t ] induced by U s t is a superbubbloid in G if the following three conditions are satisfied:
(S1)
t U s t , i.e., t is reachable from s (reachability condition).
(S2)
U s t = U t s + (matching condition).
(S3)
G [ U s t ] is acyclic (acyclicity condition).
We call s, t, and U s t \ { s , t } the entrance, exit, and interior of the superbubbloid. We denote the induced subgraph G [ U s t ] by s , t if it is a superbubbloid with entrance s and exit t.
The reachability and matching conditions can equivalently be expressed in the following form, which usually is more convenient to use:
Lemma 2
([7]).Let G be a digraph, U V ( G ) , and s , t U . Then, U equals the set U s t of Definition 1 and satisfies (S1) and (S2) if and only if the following conditions (S.i)(S.iv) are satisfied. Moreover, U forms a superbubbloid with entrance s and exit t if and only if (S.i)(S.vi) are satisfied:
(S.i)
Every u U is reachable from s.
(S.ii)
t is reachable from every u U .
(S.iii)
If u U and w U , then every w u path contains s.
(S.iv)
If u U and w U , then every u w path contains t.
(S.v)
If ( u , v ) is an edge in G [ U ] , then every v u path in G contains both t and s.
(S.vi)
G does not contain the edge ( t , s ) .
If only (S.i)–(S.v) holds, the s , t is a weak superbubbloid.
A (weak) superbubble is a (weak) superbubbloid that is minimal in the following sense:
Definition 2.
A (weak) superbubbloid s , t is a (weak) superbubble if there is no s U s t \ { s } such that s , t is a (weak) superbubbloid.
Weak superbubbles differ from superbubbles only by (S.vi), which can be checked in constant time for each candidate (weak) superbubble. The effort to recognize superbubbles and weak superbubbles is therefore essentially the same.
The following observation, which summarizes and slightly generalizes our previous analysis [7], forms the basis of the present contribution. As in previous work on the topic [5,6,7], DFS-trees are a key ingredient.
Lemma 3.
Let G be a digraph and U s t the vertex set of a weak superbubbloid s , t in G, and suppose r is not an interior vertex or the exit of s , t . Then, either V [ r ] U s t = or U s t V [ r ] .
Proof. 
(i) Every digraph can be decomposed into strongly-connected components and acyclic components. If x V [ r ] , then every vertex reachable from x is also contained in V [ r ] . Thus, in particular, every strongly-connected component of G is either contained in V [ r ] or disjoint from V [ r ] . Sung’s theorem ([5] and [7] (Thm.1)) ensures that every superbubbloid is either contained in a strongly- connected component C or an acyclic component A of G. Now, suppose V [ r ] U s t , and let x U s t be the first vertex of the DFS in s , t . By definition (of weak superbubbloids) x = s , since no other vertex in U s t is reachable from outside U s t , and the DFS assumption does not start at an interior vertex or the exit of s , t . The reachability axiom (S.ii) ensures that every u U s t is reached by the DFS whenever s V ( G ) , i.e., U s t V [ r ] . □
Lemma 3 is a variant of the key theorem of [5].
Corollary 2.
Let G = ( V , E ) be a digraph and U s t the vertex set of a weak superbubbloid s , t in G. Let r 1 , r 2 , , r k V be such that none of the r i are an interior or an exit vertex of s , t . Set W j : = i = 1 j V [ r i ] and V [ r j ] = V [ r j ] \ W j 1 . Then, either U s t V [ r j ] = or U s t V [ r j ] .
Proof. 
By Lemma 3, U s t is either contained in the intersection of two or more reachable sets V [ r i ] or is disjoint from it. As an immediate consequence, it is also either contained in the difference of two reachable sets or disjoint from it. □
Lemma 4.
Let G be a digraph; let U s t be the vertex set of a weak superbubbloid s , t in G; let T be a DFS-tree on G with root r U s t \ { s } ; and let π be the postorder w.r.t. T. Then:
(i) 
The induced subgraph G [ U s t ] contains no back edges w.r.t. T, except possibly ( t , s ) .
(ii) 
If U s t V ( T ) , then { π ( u ) | u U s t } = [ π ( t ) , π ( s ) ] is an interval w.r.t. to π.
Proof. 
(i) The statement is trivial if s , t is not contained in T. If s , t resides in an acyclic part A of G, there are no back edges because A cannot contain back edges by acyclicity. If s , t is contained in a strongly-connected component C, the proof of Lemma 9 of [7] also implies Assertion (i) because the DFS-tree T, in particular, contains a DFS-tree of C as a subtree and back edges in G can only be located within a strongly-connected component.
(ii) Since the DFS generating T enters s , t through s and leaves it through t, the preorder ρ satisfies ρ ( s ) < ρ ( t ) . Since t is reachable from every u U s t , we conclude that any DFS reaches t before completing any u U s t ; hence, t precedes any other u U s t in postorder, i.e., π ( u ) > π ( t ) . Since u U s t is not reachable without passing through s, every other vertex in u U s t precedes s in postorder, i.e., π ( s ) > π ( u ) . Now, suppose there is some w U s t with π ( s ) > π ( w ) > π ( t ) . Then, w must be reachable from s along a directed path that does not pass through t, a contradiction to the definition of weak superbubbloids. Hence, the vertices of a superbubbloid form an interval in postorder of the DFS-tree T. □
Statement (ii) rephrases the key result of [6], although we do not need to assume that G is an acyclic digraph. Conceptually, Lemma 4 suggests that it might not be necessary to first identify the strongly-connected components of G [5] or the construct acyclic auxiliary digraphs [6] in order to find all weak superbubbles. Lemma 2 then ensures that a single DFS-forest is sufficient.

2.3. Superbubble Detection

We next show how to retrieve all weak superbubbles of a digraph G that are located within the induced subgraph G [ V [ r ] ] of G. To this end, we use a slightly modified version of the algorithm DAGsuperbubble described in [7]. It was originally designed to operate on acyclic auxiliary graphs with a single source. Thus, it could be assumed that a DFS-tree rooted on this source reached all vertices. Here, we intend to apply it to the unmodified input graph, which is neither acyclic, nor guaranteed to have a single source. It, therefore, needs to be modified to deal appropriately with back edges within the DFS-tree and the existence of vertices outside the DFS-tree. To this end, vertices in V [ r ] that cannot be contained in a superbubble have to be identified. By Lemma 4, there are two possible obstructions for a vertex u: (i) u has an edge that is a back edge in the DFS-tree; (ii) u is incident to an edge ( x , u ) or ( u , x ) where x V [ r ] .
The basic idea of DAGsuperbubble is to identify minimal intervals in reverse postorder π ¯ : = | V [ r ] | π ( v ) 1 of the DFS-tree T that satisfy conditions equivalent to membership in a superbubbloid. These conditions are expressed in terms of a pair of helper functions with the help of reverse postorder π ¯ ( v ) on T. As in [7], OutParent ( v ) denotes the first vertex (w.r.t. reverse postorder π ¯ ) in T from which v can be reached. Similarly, OutChild ( v ) is the last child vertex reachable from v.
OutParent ( v ) : = 1 if no ( u , v ) E ( G ) exists 1 if ( u , v ) E ( G ) u V [ r ] 1 if a back edge ( u , v ) E ( G ) min ( { π ¯ ( u ) | ( u , v ) E ( G ) } ) otherwise OutChild ( v ) : = if no ( v , u ) E ( G ) exists if ( v , u ) E ( G ) u V [ r ] if there is a back edge ( v , u ) E ( G ) max ( { π ¯ ( u ) | ( v , u ) E ( G ) } ) otherwise
These functions are extended to intervals on π ¯ as follows:
OutParent ( [ i , j ] ) : = min { OutParent ( v ) v V [ r ] i π ¯ ( v ) j } OutChild ( [ i , j ] ) : = max { OutChild ( v ) v V [ r ] i π ¯ ( v ) j }
In [7], we derived a characterization of weak superbubbloids in terms of OutParent ( [ i , j ] ) and OutChild ( [ i , j ] ) for the case of acyclic digraphs. Here, we generalize this condition to general graph using the modified definition of OutParent ( v ) and OutChild ( v ) . The difference is that the situation that back edges and edges connecting to the outside of the DFS-tree are considered. In either case, the corresponding vertices are marked by 1 or ∞, respectively, to indicate that they cannot be part of superbubbloids.
Theorem 1.
Let G be a digraph; let T be a DFS-tree on G with a root r that is not an interior vertex or exit of a weak superbubbloid; and denote by π ¯ the reverse postorder on T. Then, s , t is a weak superbubbloid in G whose vertex set U s t satisfies U s t V [ r ] if and only if the following conditions are satisfied:
(F1)
OutParent ( [ π ¯ ( s ) + 1 , π ¯ ( t ) ] ) = π ¯ ( s ) (predecessor property)
(F2)
OutChild ( [ π ¯ ( s ) , π ¯ ( t ) 1 ] ) = π ¯ ( t ) (successor property)
Proof. 
It was shown in [7] (theorem 2) that the statement is true for acyclic digraphs. We first note that by Lemma 3, every weak superbubbloid intersecting V [ r ] is contained in V [ r ] , i.e., in V ( T ) . For the purpose of the proof, consider the auxiliary graph G ^ [ V ( T ) ] with edge set E ( G ^ [ V ( T ) ] ) = E ( G [ V ( T ) ] ) \ { e e is a back edge w . r . t . T } . By construction, G ^ [ V ( T ) ] is acyclic, and every vertex is in T. Thus, every superbubbloid s , t (with vertex set U s t ) in G ^ [ V ( T ) ] is characterized by Conditions (F1) and (F2). It is, furthermore, a weak superbubbloid in G if and only if the following conditions hold:
(i)
For every u U s t \ { s } , there is no edge ( x , u ) E ( G ) such that x U s t ;
(ii)
For every v U s t \ { t } , there is no edge ( v , x ) E ( G ) such that x U s t ; and
(iii)
G [ U s t ] without the edge ( t , s ) E ( G ) acyclic.
Only edges not contained in G ^ [ V ( T ) ] need to be considered for Conditions (i) and (ii), because no such edges exist within G ^ [ V ( T ) ] due to the assumption that s , t is a weak superbubbloid in G ^ [ V ( T ) ] . For (iii), only the back edges are of interest. By definition, a back edge creates a cycle in G ^ [ V ( T ) ] . A back edge ( v , u ) with u U s t would violate (iii) when v U s t or (i) if v U s t . Analogously, if v U s t and u U s t , then (ii) is violated. Thus, a weak superbubbloid cannot contain the head or tail of a back edge. Only for Condition (i), we also need to consider the case that x V ( T ) .
(F1) can be satisfied only if OutParent ( u ) > 1 for every u U s t \ { s } . Analogously, (F2) can only be true if OutChild ( u ) < for all u U s t \ { t } . Hence, it suffices to rule out false positive weak superbubbloids in G by ensuring that every vertex u that violates one of the three conditions also violates (F1) or (F2). This is achieved by setting OutParent ( u ) = 1 for a vertex u if there is an edge ( x , u ) such that x V ( T ) or ( x , u ) is a back edge; analogously, we set OutChild ( v ) = for all v with an incident edge ( v , x ) such that x V ( T ) or ( v , x ) is a back edge. Equation (3) implements exactly these conditions. Thus, only weak superbubbloids fulfill (F1) and (F2).
Conversely, it suffices to note that by Lemma 4(ii), every weak superbubbloid forms a contiguous interval w.r.t. the postorder π of T and, thus, also w.r.t. the reverse postorder π ¯ of T. □
We denote by Superbubble the algorithm DAGsuperbubble with the modified functions OutParent ( . ) and OutChild ( . ) as described above. By construction, Superbubble identifies minimal intervals of π ¯ that satisfy (F1) and (F2); see Figure 1 for an illustration and [7] for full details. Since the modification of OutParent ( . ) and OutChild ( . ) only amounts to setting additional entries to 1 or , respectively, the performance remains unaffected. According to Theorem 1, the minimal intervals satisfying (F1) and (F2) are exactly the minimal weak superbubbloids and, thus, by definition, the weak superbubbles. Therefore, we have:
Corollary 3.
Let G be a digraph, and let T be a DFS-tree on G with a root r that is not an interior vertex or exit of a weak superbubble. Then,Superbubblecorrectly identifies exactly the weak superbubbles s , t in G whose vertex set satisfies U s t V [ r ] .
It is straightforward to extend this result to a DFS-forest that covers V ( G ) entirely: This forest is constructed by first constructing T 1 with root r 1 covering V [ r 1 ] . Then, T 2 is constructed from a root r 2 searching on V ( G ) \ V [ r 1 ] , and so on; see Lemma 2. This amounts to constructing an auxiliary graph G from G by adding an artificial root r 0 and out-edges ( r 0 , r 1 ) , ( r 0 , r 2 ) ,..., ( r 0 , r k ) , and defining the sibling order of the roots as r 1 r 2 r k . The DFS-forest F ( r 1 , , r k ) with given roots r 1 , r 2 ,..., r k on G is then equivalent to the DFS-tree T rooted at r 0 on G if we define the reverse postorder of T such that π ¯ ( r 0 ) = 0 . We note, furthermore, that sibling order, i.e., the order in which the roots are used to seed DFS, is arbitrary.
Corollary 4.
Let G be a digraph, and let F be a DFS-forest on G comprising DFS-trees T i with roots r i , 1 i k , none of which is an interior vertex or exit of a weak superbubble. Let π ¯ be the reverse postorder on F, obtained by concatenating the reverse postorders on the constituent DFS-trees. Then,Superbubblecorrectly identifies exactly the weak superbubbles s , t in G. Furthermore, given the roots r i ,Superbubblehas a running time of O ( | E | + | V | ) .
Proof. 
Correctness follows immediately from Corollary 3, the construction of T on the auxiliary digraph G , and Lemma 2. During the DFS, each out-edge is considered exactly once, and each vertex is traversed twice. The number k of required roots is limited by | V | . For each vertex v, checking whether OutParent ( v ) = 1 or OutChild ( v ) = requires checking all neighbors only; hence, the total effort is no more than O ( | E | + | V | ) . The linear time complexity of DAGsuperbubble, finally, is proven in [7]. □
It is important to note the correctness of Superbubble, Corollary 3, crucially depends on the correct choice of the root r of the DFS-tree. The remaining problem thus is to find a suitable sequence of roots r 1 , r 2 ,..., r k .
Definition 3.
A vertex r V is a legitimate root if for every weak superbubble s , t in G with vertex set U s t , we have either U s t V [ r ] and t s (in the ancestor order of a search tree with root r), or U s t V [ r ] = .
We can summarize the discussion in the following form:
Corollary 5.
The algorithmSuperbubbledetects all weak superbubbles in G if and only if there is a set { r 1 , r 2 , , r k } of legitimate roots such that the DFS-forest F ( r 1 , r 2 , , r k ) covers V ( G ) .
Corollary 6.
A vertex r V is a legitimate root if and only if r is neither an interior nor an exit of a weak superbubble.
Proof. 
By Corollary 4, a root is legitimate if it is not the exit or an interior vertex of a weak superbubble. Conversely, if r is an interior vertex or the exit of s , t , then a DFS-tree rooted in r reaches the entrance s either not at all or there is not search tree with root r such that t s , since by definition of a weak superbubble, the exit t is found before s along every path from r to s. □
Lemma 5.
Let G be a digraph and v V ( G ) a source, i.e., a vertex with in-degree zero. Then, v is a legitimate root.
Proof. 
Since v is not reachable from any other vertex, it is only reachable by DFS if the traversal starts in v. By the same argument, v is neither an interior vertex, nor the exit of a weak superbubble and, thus, is a legitimate root. □
Unfortunately, there is no guarantee that a digraph G has source vertices, and even if they exist, not every vertex of G is necessarily reachable from them. The task is, therefore, to identify legitimate roots located within strongly-connected components.

2.4. Cycles, C -Covers, and C -Cuts

Definition 4.
Let G be a digraph. A set C = { c 1 , , c k } V ( G ) is a cycle in G if k = | C | and E ( C ) : = { ( c 1 , c 2 ) , , ( c k 1 , c k ) , ( c k , c 1 ) } E ( G ) . A pair of vertices c i , c j C determines a cycle interval:
c i , c j : = c i + 1 , , c j 1 if i < j c i + 1 , , c k c 1 , , c j 1 o t h e r w i s e
By definition, the vertices c i are pairwise distinct and indexed consecutively along C. Importantly, cycle intervals contain only the interior of the unique path in C connecting the defining endpoints c i and c j . Thus, c 1 , c 2 = if ( c 1 , c 2 ) E ( C ) and v , v = C \ { v } for all v C . The C-distance of two vertices c i and c j along a cycle C is the length of the directed path, i.e., the number of edges, from c i to c j . More explicitly,
d C ( c i , c j ) : = j i if i < j j i + k if i j = | c i , c j | + 1
since the number of inner vertices is one less than the number of edges. In particular, d C ( v , v ) = | C | for all v C . The C-distance d C is not symmetric. Instead, we have d C ( u , v ) + d C ( v , u ) = | C | for all two vertices u v C . Another useful consequence of the definition of d C is:
d C ( v , w ) < d C ( v , u ) d C ( w , u ) = d C ( v , u ) d C ( v , w )
The following implication will be useful later on:
Corollary 7.
Let G be a digraph; let C a cycle in G; and let c 1 , c 2 , c 3 C . Then, d C ( c 1 , c 2 ) d C ( c 1 , c 3 ) if and only if c 1 c 3 , c 2 { c 3 } .
Proof. 
If c 1 , c 2 , and c 3 are pairwise distinct, the l.h.s. is true if the path from c 1 to c 2 is a subpath of the path from c 1 to c 3 , i.e., c 1 c 2 , c 3 and, thus, c 1 c 3 , c 2 { c 3 } . The converse is obvious. The statement is trivial for c 1 = c 3 . If c 2 = c 3 , the l.h.s. is always true, while on the r.h.s., we have c 2 , c 2 { c 2 } = C c 1 . For c 1 = c 2 , both the l.h.s. and the r.h.s. are satisfied only if c 2 = c 3 . □
In the following, it will be useful to know whether two vertices on a cycle are also reachable via a directed path that is disjoint from C. We formalize this idea as a binary relation on C.
Definition 5.
Let G be a digraph and C a cycle in G and s , t V . Then, t isC-reachable from s, in symbols s C t , if there is a path p = { s = v 0 , v h = t } such that h 1 and v i C for 0 < i < h .
We have used the letters s and t here since C-reachability will be used to identify potential candidates for entrance and exit of superbubbles. C-reachability is defined not only for vertices in the “reference cycle” C. It satisfies a restricted transitivity property: If v V ( G ) \ C , s C v , and v C t , then s C t . Another interesting observation is that s C s implies that there is a directed cycle C such that C C = { s } . As an immediate consequence of Definition 5, we obtain:
Lemma 6.
Let G be a digraph, C a cycle in G, c 1 , c 2 C such that c 1 C c 2 , and d C ( c 1 , c 2 ) > 1 . Then, c 1 and c 2 are connected by (at least) two edge-disjoint directed paths. In particular, { ( v i , v i + 1 ) 0 i < h } E ( C ) = .
Definition 6.
Let C be a cycle in the digraph G and u , v C . Then, u , v is C -covered if u C v .
As an immediate consequence of the definition, v C v implies that C \ { v } is covered, while nothing is covered if ( u , v ) E ( C ) .
Consider two C-intervals u , v and x , y on the same cycle C of the digraph G. We say that u , v is included in x , y if u , v x , y , u , v and x , y are disjoint if u , v x , y = , and x , y extends u , v if d C ( x , v ) < d C ( u , v ) and d C ( x , v ) < d C ( x , y ) . In particular, if x , y extends u , v , then x u , v , since the interval boundaries themselves are not considered part of the C-intervals. For each pair of distinct C-intervals, thus exactly one of the following four statements is true: (a) the C-interval are disjoint; (b) one C-interval is contained in (i.e., a proper subset of) the other one; (c) one C-interval, say x , y , extends the other one, but not vice versa, i.e., x u , v and y u , v ; (d) both C-intervals extend each other, i.e., x , y u , v . Figure 2 illustrates the four cases. Note that in Case (c), the interval boundaries are arranged in the order u x v y u along the cycle, while in Case (d), the arrangement is u y x v u along C.
In the following, we will use the notation:
Q ( C ) : = { u , v u C v , u , v C } Q ( C ) : = B Q ( C ) B = { w B Q ( C ) : w B }
for the set of all C -covered intervals and the set of all C -covered vertices of C, respectively. Note that Q ( C ) since u C v holds for ( u , v ) E ( C ) . By the same argument, there is at least one interval u , v Q ( C ) for each u C , albeit some or even all of these may be empty.
Definition 7.
A subset B Q ( C ) is a C -cover of C if B B B = Q ( C ) , and B is a total C -cover of C if B B B = C . We say that C is totally C -covered if C has a total C -cover.
Note that C is totally C -covered if and only if Q ( C ) = C .
Definition 8.
A vertex in v K ( C ) : = C \ Q ( C ) is a C -cut vertex.
Obviously, C is either totally C -covered or it has a non-empty set K ( C ) of C -cut vertices.
Definition 9.
Let C be a cycle in the digraph G. A C -cover B of C is clean if B B and B B implies B B .
In other words, in a clean C -cover, no C -covered interval is contained within another one.
Corollary 8.
Let C be a cycle in the digraph G, and let B be a clean C -cover. Then, either B = { } or, for every C ( u , v ) B , d C ( u , v ) > 1 .
Proof. 
Recall that C ( u , v ) = if and only if d C ( u , v ) = 1 . Thus, Q ( C ) = { } if and only if there is no u , v B with d C ( u , v ) > 1 . Since the empty set is a subset of every other set, d C ( u , v ) > 1 for every u , v B unless B = Q ( C ) = { } . □
Lemma 7.
Let C be a cycle in the digraph G. Then, Q ( C ) contains a clean C -cover B .
Proof. 
Let B Q ( C ) be a set of C -covered intervals that together C -cover Q ( C ) . Suppose B is not clean. Then, there are two intervals p , q B and u , v B such that p , q u , v . Then, B = B \ { p , q } still C -cover Q ( C ) . The removal of such redundant intervals can be repeated until no further removable interval can be found. By Definition 9, the remaining C -cover is clean. □
Definition 10.
Let C be a cycle in a digraph G. Then:
L ( C ) = u , v Q ( C ) f o r a l l v C a n d u , v Q ( C ) , d C ( u , v ) d C ( u , v )
By definition, L ( C ) consists of all C -covered intervals for which there is no larger C -covered interval with the same starting point. Since every p , q Q ( C ) \ L ( C ) is contained in a interval with the same starting point, L ( C ) is a C -cover of C. Thus, Lemma 7 implies:
Corollary 9.
Let C be a cycle in a digraph G. Then, there is clean cover B L ( C ) .
Lemma 8.
Let C be a cycle in the digraph G. A clean C -cover B of C is total if and only if B and every B B is extended by at least one B B .
Proof. 
If B = , then Q ( C ) = , and thus, B is not total. In the following, we assume B is a clean C -cover. Suppose, for contradiction, that u , v B is not extended by any B B . Then, any interval B B C -covering v would have to contain u , v , contradicting the assumption that B is clean. Thus, v is a C -cut vertex, and hence, B is not total. If B , therefore it is non-empty, and every B B is extended by some B B .
Conversely, suppose c is a C -cut vertex of C. If u , c B for some u, then the first part of the proof implies that u , c is not extended by any B B . If B contains no interval u , c , then consider the vertex v such that u , v B for which d C ( v , c ) is minimal. Since c is a C -cut vertex, there is no extension of u , v , since any such extension B would either contradict the minimality of d C ( v , c ) or C -cover c, thereby contradicting the assumption that c is a C -cut vertex. Thus, v is again a C -cut vertex. As shown in the first part of the proof, u , v ; therefore, it is not extended by any B B . We conclude that unless B is a total C -cover or B = , there is an interval B B without an extension. □
Figure 3 shows an example of a cycle with a total C -cover and a cycle with a C -cut, respectively. Since the largest C -interval in C is v , v = C \ { v } for some v, every total C -cover comprises at least two C -covered intervals.
The following lemma provides us with a convenient way to obtain a total C -cover.
Lemma 9.
Let C be a cycle in G, v C , and c 1 , c 2 , c 3 , c 4 C with d C ( c 1 , c 3 ) d C ( c 1 , c 2 ) < d C ( c 1 , c 4 ) . Then, c 1 C v , c 2 C v , v C c 3 , and v C c 4 imply that B : = { c 1 , c 4 , c 2 , c 3 } is a total clean C -cover of C.
Proof. 
By construction, c 1 , c 4 and c 2 , c 3 are C -covered intervals. By definition, we have c 2 c 1 , c 4 , and c 2 , c 3 extends c 1 , c 4 . Since c 3 c 1 , c 4 , the two intervals cover all of C. Furthermore, the cover B consists of only two intervals that are not subsets of each other; thus, it is clean. □
We will refer to this type of total clean C -cover as a single-vertex cover of C. An example is shown in Figure 4.

2.5. Cycles, C -Cover, C -Cuts, and Superbubbles

A key result of [5] states that every superbubble is either contained in or disjoint of any strongly-connected components. The following results on the interaction of cycles and superbubbles are a generalization of this observation. The acyclicity condition (S.v) can be restated in the following way:
Lemma 10.
Let s , t be a weak superbubbloid in the digraph G and u s , t . Then, every cycle containing u also contains s and t.
Proof. 
If u s , then all in-neighbors of u are contained in s , t . Similarly, if u t , then all out-neighbors of u are contained in s , t . Since every cycle through u contains both in- and out-neighbors of u, it, in particular, contains an edge e in s , t . (S.v) now implies any cycle through e contains both s and t. □
Lemma 11.
Let C be a cycle in the digraph G, and let B be a total clean C -cover of C. If u , v B , then v is neither an interior, nor an exit of a weak superbubble, i.e., v is a legitimate root.
Proof. 
Assume, for contradiction, that v is an interior or the exit of the superbubble s , t . Since C is totally C -covered by assumption, Corollary 8 implies d C ( u , v ) > 1 . Thus, by Lemma 6, there are (at least) two edge-disjoint paths from u to v. Since neither path can leave s , t before passing through s, neither of them contains the entrance s, and hence, both are contained in the weak superbubble. Thus, s , t contains u , v .
Since B is a total clean C -cover of C, there is an interval p , q B that extends u , v , i.e., p u , v , and hence, p is an inner vertex of s , t . Therefore, s , t contains p , q , and it again has an extending C -interval. Repeating the argument, we conclude that every vertex of Q ( C ) is an inner vertex of s , t . Since the cover B is total, Q ( C ) = C , i.e., the cycle C consists entirely of interior vertices of s , t , i.e., C is a proper subset of s , t . This contradicts the acyclicity condition (S.v). □
Corollary 10.
Let C be a cycle in the digraph G. Suppose C is totally C -covered, and let u , v L ( C ) such that d C ( u , v ) d C ( u , v ) for all u , v L ( C ) . Then, v is a legitimate root.
Proof. 
The longest C -interval u , v L ( C ) by construction cannot be contained within another C -interval. Therefore, u , v is contained in the clean cover B L ( C ) of Corollary 9. By Lemma 11, its endpoint v is a legitimate root. □
Let us now turn to cycles with C -cut vertices:
Lemma 12.
Let C be a cycle of the digraph G, and let c be a C -cut point of C, i.e., c K ( C ) . Then, c is not an interior vertex of any weak superbubble.
Proof. 
Assume, for contradiction, that c is an interior vertex of a weak superbubble s , t . Then, there is a path p from s to t not passing through c. Otherwise, s , c is a superbubbloid, contradicting the assumption that s , t is a weak superbubble; see corollary 5 in [7]. Along p, let u be the last vertex on C before c, and let v be the first vertex on C after c. Thus, u C v . Therefore, c is C -covered in C, a contradiction. □
The example in Figure 5 shows that it is possible that every entrance of superbubble is at the same time the exit of another superbubble. Such graphs do not have any legitimate root. Nevertheless, it is easily possible to obtain all the superbubbles. To this end, fix a C -cut vertex c for some cycle C in G, and consider the auxiliary digraph G # obtained from G by splitting c into two vertices c and c so that c retains only the in-edges and c retains only the out-edge.
Lemma 13.
Let C be a cycle in the digraph G, c C a C -cut vertex, and G # the digraph obtained from G by splitting c. If s , t is a weak superbubble in G, then it is also a weak superbubble in G # , where c as an entrance in G corresponds to c in G # and c as an exit in G corresponds to c in G # . Conversely, every weak superbubble s , t with { s , t } { c , c } in G # is also a weak superbubble in G.
Proof. 
For the proof, we construct the auxiliary graph G ˜ # by inserting the edge ( c , c ) into G # . Then, there is a 1-1 relationship between the set of paths in G and the set of paths that do not start or end with the edge ( c , c ) in G ˜ # , which is constructed as follows: If p starts at c in G, it starts in c in G ˜ # ; if p ends at c in G, it ends at c in G ˜ # ; and if p runs through c in G, then it runs through the edge ( c , c ) in G ˜ # . The 1-1 correspondence of weak superbubbles now follows immediately from the equivalence of the path systems in G and G ˜ # since reachability is the same for every pair u , v , with c as the starting point corresponding to c and c as the endpoint corresponding to c . Thus, G and G ˜ # have the same superbubbles, except possibly for the ones with { s , t } = { c , c } in G ˜ # . Now, consider a DFS-tree on G # rooted in c . The edge ( c , c ) is not a tree edge and necessarily appears as a back edge. Since c is a C -cut vertex, c and c are not interior vertices of any weak superbubble in G ˜ # . Thus, the edge ( c , c ) does not affect any weak superbubble of G ˜ # , and thus, G # and G ˜ # have the same weak superbubbles, except possibly the ones with { s , t } = { c , c } . □
The only potential differences between the weak superbubbles of G and G # is, therefore, the possibility that G # contains c , c or c , c as an additional weak superbubble. Of course, it is easy to detect and remove the additional weak superbubble. Since c is a source in G # , we can apply Superbubble to G # and remove the possible spurious weak superbubble c , c in order to obtain the correct set of weak superbubbles of G. In contrast to the auxiliary digraph constructions suggested in [5], G # contains only a single extra vertex instead of doubling the size. More importantly, however, it not necessary to construct G # explicitly. Instead, on can modify the DFS starting at c in G in the following manner: when c is encountered for the first time as an out-neighbor of a tree vertex u, then c is inserted as with parent u and no further out-neighbors, with only a constant overhead. The algorithm Superbubble applied to G # extracts the minimal intervals satisfying (F1) and (F2) (w.r.t.) the reverse postorder π ¯ of the DFS-tree rooted as c , and thus correctly identifies the weak superbubbles of G # . The modified DFS on G rooted at c by construction yields the same DFS-tree on G # , and thus the same reverse postorder. Together with setting OutChild ( c ) = OutChild ( c ) , OutParent ( c ) = OutParent ( c ) , OutChild ( c ) = , and OutParent ( c ) = 1 , Superbubble operating on the modified DFS-tree thus correctly identifies the weak superbubbles in G # . We refer to this algorithm, which is equivalent to applying Superbubble to G # , as Superbubble#.
Definition 11.
Let G be a digraph. Then, r V is a quasi-legitimate root if either:
(i) 
r is source in G,
(ii) 
r is the end point of an interval u , r B of a total clean C -cover of some cycle C in G, or
(iii) 
r is C -cut vertex of some cycle C in G.
Our discussion so far can be summarized as:
Corollary 11.
Algorithm Superbubble# correctly identifies the superbubbles in G [ V [ r ] ] if and only if r is a quasi-legitimate root.
As an immediate consequence of Lemmas 11 and 12, every cycle contains a quasi-legitimate root. Recalling that every vertex in the digraph G can be reached either from a source vertex or from a cycle, we finally obtain:
Theorem 2.
Every digraph G contains a set of quasi-legitimate roots { r 1 , r 2 , , r k } . Given these roots, the algorithmSuperbubble# correctly identifies all superbubbles of G in linear time.
It remains to show, therefore, that a suitable set of roots can be identified in linear time. Clearly, this is possible for the sources. For superbubbles that cannot be reached from a source vertex, a suitable set of cycles needs to be identified.
Lemma 14.
Let F = ( T [ v 1 ] , T [ v 2 ] , , T [ v k ] ) be an arbitrary DFS-forest of G with constituent ordered trees T [ v i ] rooted at v i , and let C be a cycle in G. Then, C V ( T [ v i ] ) implies C V ( T ( v i ) ) , and there is a u C such that C V ( T [ u ] ) .
Proof. 
Let v i be the first root of F that can reach any vertex of C. Then, by definition of a cycle, C V [ v i ] . Thus, C V ( T [ v i ] ) . Further, let u be the first vertex that is reached from v i in the DFS. Then, every other vertex of C is reached from u in the DFS. Thus, C V ( T [ u ] ) . □
The same is true for strongly-connected components:
Lemma 15.
[8] (corollary 11) Let S be a strongly-connected component in G, and let T be a DFS-tree with S V ( T ) . Then, there is a vertex v S such that S V ( T [ v ] ) . We call v the root of the strongly-connected component S in T.
Our aim is now to find a set of “start cycles” such that every cycle C is reachable from at least one of these start cycles.
Lemma 16.
Let T be a DFS-tree on the digraph G rooted in v, and let W be the set of ≺-maximal vertices w that have an incoming back edge ( u , w ) . Then, (i) w W is contained in a cycle, and (ii) every cycle C V ( T ) is satisfied C V ( T [ w ] ) for some w W .
Proof. 
Property (i) is an immediate consequence of the definition of DFS. Now, suppose u V ( T [ w ] ) for some w W . Then, by construction, none of the vertices along the path from the root v to u have an incoming back edge, and thus, neither u, nor one of its ancestors are contained in a cycle. Thus, if x C for some cycle C V ( T ) , then a vertex w W exists such that x V ( T [ w ] ) , and thus, C V ( T [ w ] ) . □
Note that W = if T [ v ] does not contain a cycle. Since the vertex set of every cycle in the digraph G is necessarily contained in one of the constituent trees of a DFS-forest, we immediately obtain:
Corollary 12.
Let F be a DFS-forest on the digraph G, and let W by the set of ≺-maximal vertices w that have an incoming back edge ( u , w ) . Then, (i) w W is contained in a cycle, and (ii) every cycle C in G is satisfied C V ( T [ w ] ) for some w W and some T F .
Lemma 17.
A set of cycles { C 1 , C 2 , . . . , C n } from which all cycles in G are reachable can be constructed in O ( | E | + | V | ) time.
Proof. 
The DFS-forest F on the digraph G is obtained in O ( | E | + | V | ) time. The set W is easily identified by a preorder traversal of F omitting a subtree as soon as a vertex w has an incoming back edge. The worst-case effort is O ( | V | ) since we only traverse the forest, not the entire digraph G. Given W and the associated back edges ( u k , w k ) identified in the previous step for each w k W , the cycle C k is explicitly retrieved by following the parent links of F from u k back to w k in O ( | V | ) time. □
Lemma 17 ensures that a sufficient set of cycles can be found in linear time. More precisely, using the sources of G and a quasi-legitimate root r i in each cycle C i as roots, the algorithm Superbubble# correctly identifies all superbubbles in G in linear time. It remains to show that we can identify a quasi-legitimate root in a cycle C i .

2.6. Identification of Quasi-Legitimate Roots

The obvious approach to identify quasi-legitimate roots is to construct a clean C -cover. The obvious starting point is L ( C ) since it requires the construction of no more than the | C | C -path. This can be achieved in polynomial time, e.g., using an independent DFS-tree rooted at c C that ignores the edges of C. This naive approach, however, exceeds linear time even for a single cycle.
For c C , we construct a modified DFS-tree T c by excluding all other vertices of C from G. By construction, u C is C -reachable from c if and only if T c contains an in-neighbor u of u, i.e., there is an edge ( u , u ) E ( G ) with u V ( T c ) .
For each v V ( T c ) , we are interested in the vertices c min C and c max C that are C -reachable from v and minimize and maximize d C ( c , c min ) and d C ( c , c max ) . These can be recursively computed on T c by traversing T c in postorder. For each v V ( T c ) , c min and c max are obtained by comparing the c min and c max values for the out-neighbors of v along T, and the vertices reachable directly from v. More precisely, at each leaf v of T c , c max [ c , v ] is initialized by the vertex c C such that ( v , c ) E ( G ) and c maximizes d C ( c , c ) . At each inner vertex v of T c , c max [ c , v ] is computed as the vertex c maximizes d C ( c , c ) from the following set of candidates: { c max [ c , u ] | ( v , u ) E ( T c ) } { u C | ( v , u ) E ( G ) } . The vertex C -reachable from c with the maximal value of d C ( c . ) is thus c max [ c , c ] . The same computations are used for c min [ c , v ] , except that d C ( c , . ) is minimized instead of maximized. The computations of T c and values of c min [ c , v ] and c max [ c , v ] clearly can be performed in linear time. Repeating this for each c C , however, will, in general, exceed linear time since the length | C | is not bounded in general.
We can mostly reuse the information stored in T c , however. A crucial observation is the following:
Lemma 18.
Let C be a cycle of the digraph G; consider two distinct cycle vertices c 1 , c 2 C ; and let v C with c 1 C v and c 2 C v . If d C ( c 2 , c min [ c 1 , v ] ) d C ( c 2 , c max [ c 1 , v ] ) , then c min [ c 2 , v ] = c min [ c 1 , v ] and c max [ c 2 , v ] = c max [ c 1 , v ] . Otherwise, B = { c 1 , c max [ c 1 , v ] , c 2 , c min [ c 1 , v ] } forms a one-vertex C -cover.
Proof. 
For simplicity, we write c 3 = c min [ c 1 , v ] and c 4 = c max [ c 1 , v ] . By definition of c min and c max , we have (1) d C ( c 1 , c 3 ) d C ( c 1 , c 4 ) , and (2) for every c C satisfying v C c , we have c c 3 , c 4 { c 3 , c 4 } . Starting from Property (1), Corollary 7 implies c 1 c 4 , c 3 { c 4 } . As a consequence, for every c c 3 , c 4 { c 4 } , we have d C ( c 1 , c ) = d C ( c 1 , c 3 ) + d C ( c 3 , c ) . Since d C ( c 1 , c 3 ) is just a constant, d C ( c 1 , a ) d C ( c 1 , b ) implies d C ( c 3 , a ) d C ( c 3 , b ) for all a , b c 3 , c 4 { c 4 } .
First, assume d C ( c 2 , c 3 ) d C ( c 2 , c 4 ) . Then, Corollary 7 implies c 2 c 4 , c 3 { c 4 } . The same arguments as for c 1 show that d C ( c 2 , a ) d C ( c 2 , b ) implies d C ( c 3 , a ) d C ( c 3 , b ) , which in turn implies d C ( c 1 , a ) d C ( c 1 , b ) for all a , b c 3 , c 4 { c 4 } . Because of Property (2), this implication can be used in particular for every c C for which v C c might hold. Therefore, the same two vertices minimize and maximize d C ( c 1 , . ) and d C ( c 2 , . ) , and thus, we arrive at c min [ c 2 , v ] = c min [ c 1 , v ] and c max [ c 2 , v ] = c max [ c 1 , v ] .
Now, suppose d C ( c 2 , c 4 ) < d C ( c 2 , c 3 ) . Then, c 3 c 4 (otherwise, the distances would be equal), and Corollary 7 implies c 2 c 3 , c 4 { c 3 } . Since c 1 c 4 , c 3 { c 4 } , we obtain d C ( c 1 , c 3 ) d C ( c 1 , c 2 ) < d C ( c 1 , c 4 ) . By Lemma 9, B : = { c 1 , c 4 , c 2 , c 3 } is a one-vertex cover of C. □
The use of Lemma 18 is that it allows either to use the c min [ c 1 , v ] and c max [ c 1 , v ] values also for c 2 , or we obtain a one-vertex C -cover, which immediately provides us with a legitimate root according to Lemma 11. Thus, we need to continue the computation of c min [ c i , v ] and c max [ c i , v ] only until we encounter a one-vertex cover. Up to this point, the values of c min [ c i , v ] and c max [ c i , v ] are independent of c i by Lemma 18.
The difficulty is to compute the c min [ c i , v ] and c max [ c i , v ] for all v V ( T c ) correctly. We have already seen above how to handle tree edges. Forward-edges in T c do not effectively contribute, because the same information (minimization or maximization over values of d C ( c , . ) ) is also propagated stepwise along the tree-edges. Cross edges, on the other hand, could add information. Postorder traversal ensures, however, that the pertinent information at their starting points is already computed in time to include them to compute the correct value, i.e., we simply have to include the cross-edges in the minimization/maximization step.
Back edges are problematic when belonging to the same strongly-connected component S as C S . In this case, they can be reached from a cycle vertex c C and themselves reach a cycle vertex u C . Such back edge, therefore, influence which cycle vertices are reachable. To handle this information, S is split into parts that are strongly connected components under the use of C -reachability. More precisely, we define a C -SCC as a strongly-connected component on the induced subgraph G [ V ( G ) \ C ] .
Consider the auxiliary graph G c with vertex set ( V ( G ) \ C ) { c } and all edges of G [ V ( G ) \ C ] , as well as all edges ( c , u ) with u V ( G ) \ C . Then, c is not contained in a cycle of G c , and thus, the SCC of G c are exactly the C -SCC and the single vertex c. By construction, T c is also a DFS-tree for G c . Thus, Tarjan’s DFS-based SCC-detection algorithm (see Lemma 15) on T c identifies the C -SCC as the SCC of G c . To mimic the traversal on G c instead on G [ ( V ( G ) \ C ) { c } ] , the graph on which T c was originally defined, it suffices to ignore the back edge leading to the root, i.e., edges of the form ( u , c ) for u V ( G ) \ C . It is thus not necessary to construct the graph G c explicitly.
The definitions of c min and c max imply:
Corollary 13.
Let C be a cycle in the digraph G; let T c be a modified DFS-tree rooted at c C ; and let S be a C -SCC with S V ( T c ) . Then, c min [ c 1 , v ] and c max [ c 1 , v ] are independent of v for every v S .
This begs the question of whether the v-independent values of c min [ c 1 , v ] and c max [ c 1 , v ] can be obtained while traversing G. A partial answer is provided by:
Corollary 14.
Let C be a cycle in the digraph G; let T c be a modified DFS-tree rooted at c C ; and let v be the root of a C -SCC. Suppose the values of c min [ c 1 , w ] and c max [ c 1 , w ] are known for w V ( T c [ v ] ) . Then, c min [ c 1 , v ] and c max [ c 1 , v ] are obtained correctly by postorder traversal of T c considering all tree and cross edges.
Proof. 
The only missing information could be a back edge ( u , w ) with u V ( T [ v ] ) and v w . Such a back edge cannot exist because v is by assumption the root of a C -SCC, and thus, there is no cycle including u, v, and w G [ V ( G ) \ C ] . □
This observation yields a simple solution to obtain the correct entries for c min [ c 1 , v ] and c max [ c 1 , v ] for every v S : determine the C -SCC and its root v, and set c min [ c 1 , v ] c min [ c 1 , v ] and c max [ c 1 , v ] c max [ c 1 , v ] .
Tarjan [8] showed that SCC can be found efficiently by DFS. Below, we will modify the approach slightly to operate on a given DFS-tree. We therefore briefly outline Tarjan’s SSC algorithm; for full details, we refer to [8]: First, the vertices are enumerated in preorder. Then, a postorder traversal is used to compute, for each v, the lowlink ( w ) , which is recursively defined as:
( v ) : = min ( { ( w ) | ( v , w ) is a tree or unfinished cross edge } { ρ ( w ) | ( v , w ) is a back edge } { ρ ( v ) } )
A cross edge is only included if it is “unfinished”, i.e., if its endpoint w has not been reported as part of a previously-completed SCC. A vertex v is the root of an SSC if ( v ) = ρ ( v ) . Tarjan’s SSC algorithm now uses a stack to iterate over every vertex of the SCC S to mark them as finished. This cannot be done in the same way in a predefined DFS-tree.
The stack can be replaced, however, by an equally-efficient iterative method: Starting from v with ( v ) = ρ ( v ) , simple traverse T [ v ] starting at v; report all “unfinished” vertices as members of the SSC; and omit every subtree rooted in a “finished” vertex. To see that this is correct, note that ( w ) ρ ( w ) for all w S \ { v } , and hence, w is “unfinished” when the postorder traversal encounters v. Lemma 15 implies that there is a path ( v , w 1 , , w h = w ) from v to w in T, with w i S and thus also “unfinished”. Thus, if u is “finished”, so are all its descendants, and the subtree T [ u ] does not need to be considered. The only difference from Tarjan’s SSC algorithm tree traversal is to retrieve S, which considers every edge of T once and thus runs in a total time of O ( | V | ) . We summarize the discussion above as:
Lemma 19.
The modified version of Tarjan’s SCC algorithm correctly identifies all strongly-connected components in T in O ( | E | + | V | ) time.
Since the correct values of c min [ c 1 , u ] and c max [ c 1 , u ] are computed by postorder traversal of T c , they are already available when the root v of a C -SCC is encountered. Thus, identification of the C -SCC and the computation of c min [ c 1 , u ] and c max [ c 1 , u ] can be combined in the same tree traversal. The same tree traversal also guarantees that for every cross edge ( u , w ) , we have either (i) u and w in the same C -SCC or (ii) the values of c min [ c 1 , w ] and c max [ c 1 , w ] are computed correctly.
Now, consider the vertex c j along C, and suppose we have not encountered a one-vertex C -cover so far. Let T j be the DFS-tree rooted in c j that ignores all vertices already included in a previous DFS-tree. As for c i , we can compute c min [ c j , v ] and c max [ c j , v ] with v T j along this tree. Then, c min [ c j , v ] either equals c min [ c j , v ] computed on T j or c min [ c i , u ] for some u such that ( v , u ) E ( G ) , depending on which has the smaller value of d C ( c j , . ) , and c max [ c j , v ] either equals c max [ c j , v ] computed on T j or c max [ c i , u ] ) , depending on which has the larger value of d C ( c j , . ) . Note that c min [ c i , v ] and c max [ c i , v ] do not actually depend on i. In a practical implementation, it is simply stored in dependence of v. The index c i only is used to keep track of the individual, disjoint DFS-trees T i rooted in c i in our arguments.
After processing all vertices of C, we have either found a one-vertex C -cover of C, or we know, for every c j C , the largest C -covered interval c j , c max ( c j ) . Thus, we directly conclude:
L ( C ) : = { c j , c max ( c j ) | c j C and c max ( c j ) }
In particular, we have shown that for each C, L ( C ) or a one-vertex cover can be constructed in linear time.
To detect a quasi-legitimate root, it is necessary to first decide whether C has a total C -cover or a non-empty set K ( C ) of C -cut vertices exists. To this end, a clean C -cover B can be used efficiently. Recall that by Lemma 8, every interval in a clean C -cover is extended by at least one other interval from the C -cover. Since a clean C -cover contains at most | C | intervals, it is easy to check in linear time whether a C -cut vertex exists: starting from an arbitrary u , v B , we initialize the upper bound of the C -covered part of C that starts at the successor of u by x : = d C ( u , v ) . For every u , v B with d C ( u , u ) < x , we check whether d C ( u , u ) > d C ( u , v ) , in which case a total cover is found, and otherwise, we update x with max ( x , d C ( u , v ^ ) ) . If no total cover is found when the intervals are exhausted, then x is a C -cut vertex (see the proof of Lemma 8). With the u , v stored, e.g., as array a [ u ] , a total cover or the C -cut vertex x is found in O ( | C | ) operations.
In practice, however, we do not have access to a clean C -cover. However, L ( C ) can be computed in linear time. By Corollary 9, there is a clean C -cover B L ( C ) . We can thus use the same procedure. The redundant intervals in L ( C ) are, by definition, contained within intervals belonging to B , and thus, they do not change the results provided the initial interval u , v is contained in the clean cover B . By Corollary 10, this is true for the longest interval u , v L ( C ) . Since L ( C ) contains at most | C | intervals, the longest interval and a cut point or the validation of a total cover can be computed in O ( | C | ) . When L ( C ) it is a total C -cover, the longest interval u , v is contained in a total clean cover, and thus, v is a legitimate root by Lemma 10. Thus, a quasi-legitimate root v can be retrieved in O ( | C | ) time. The entire procedure is summarized in Algorithm 1.
Lemma 20.
Given a cycle C in the digraph G, Algorithm 1 identifies a quasi-legitimate root in C in linear time w.r.t. the size of G [ V [ C ] ] , the induced subgraph of G reachable from C.
Proof. 
The correctness of the algorithm follows from the discussion in the previous paragraphs. The construction of DFS-trees T j together is linear in the size of G [ V [ C ] ] since each edge in G [ V [ C ] ] is considered once. The recursive computation along each T j is also linear. Since the T j are disjoint, the total effort is still linear. □
Finally, we note that by construction, no vertex in G [ V [ C ] ] reaches any cycle C disjoint from G [ V [ C ] ] . Hence, when processing the next cycle C , the vertices (and edges) already visited in the context of processing C are irrelevant, and thus, G [ V [ C ] ] can be disregarded. In other words, the DFS for the next cycle can be performed in the same digraph G, with all previously processed induced subgraphs marked as finished. This ensures an overall linear running time for the identification of starting points for all cycles C i as in Lemma 17.
Algorithm 1 get _ root ( C , G ) computes a C -cover and determines Q ( C ) , as well as a quasi-legitimate root in C.
  • Require: digraph G = ( V , E ) and cycle C
  • for c C do
  •   create DFS-tree T c with root c by ignoring finished and cycle vertices with preorder ρ .
  •   while v traverses T c in postorder do
  •     ( v ) ρ ( v )
  •    for ( v , u ) G do
  •     if u C then
  •      Update c min [ c , v ] with u
  •      Update c max [ c , v ] with u
  •     else if ( v , u ) is a back edge then
  •      Update ( v ) with ρ ( u )
  •     else
  •      if d C ( c , c min [ c , v ] ) > d C ( c , c max [ c , v ] ) then
  •       return legitimate root c min [ c , v ]
  •      Update c min [ c , v ] with c min [ c , u ]
  •      Update c max [ c , v ] with c max [ c , u ]
  •      if u is unfinished then
  •       Update ( v ) with ( u )
  •    if ( v ) = ρ ( v ) then
  •     for u in C -SCC with root v do
  •       c min [ c , u ] c min [ c , v ]
  •       c max [ c , u ] c max [ c , v ]
  •      Set u as finished
  • Set u such that d C ( c , c max [ c , c ] ) d C ( u , c max [ u , u ] ) for every c C
  • x = d C ( u , c max [ u , u ] )
  • for c C in cycle order starting from the successor of u do
  • if d C ( u , c ) = x then
  •   return quasi-legitimate root c
  • if d C ( u , c ) > d C ( u , c max [ c , c ] ) then
  •   return legitimate root c max [ u , u ]
  • x = max ( x , d C ( u , c max [ c , c ] ) )

2.7. Putting It All together

Theorem 3.
Algorithm 2 correctly identifies the superbubbles of a digraph G in linear time.
Proof. 
Theorem 2 ensures that for every digraph G, there is a set R of quasi-legitimate roots such that, given R, the algorithm Superbubble# identifies all superbubbles of G in linear time. Every vertex in V ( G ) is reachable from a source or a cycle in G. By Lemma 5, all sources are legitimate roots. Lemma 17 shows that a set of cycles can be constructed in linear time from which all vertices of G can be reached by DFS. Algorithm 1 identifies a quasi-legitimate root in a cycle (Lemma 20). As discussed in the text following Lemma 20, the effort for this step is again linear in size of G. Algorithm 2 therefore correctly identifies the superbubbles of a digraph G and does so in O ( | E | + | V | ) time. □
Algorithm 2 Identification of all superbubbles in an arbitrary digraph G.
  • Require: Digraph G
  • R all sources in G
  •  generate a random DFS-forest F ^
  •  find set W of ≺-maximal vertices with a back edge in F ^
  •  generate set C of cycles from W with F ^
  • for all cycles C k C do
  •   run get root ( C k , G ) to identify quasi-legitimate root r k
  •   add r k to R
  •  generate DFS-forest F with root set R
  •  run Superbubble# on F

3. Results

We extended the “Linear Superbubble Detection” (Https://Github.Com/Fabianexe/Superbubble) software LSD [7] with the new algorithm presented in the previous section. LSD is written in Python and uses the NetworkX package [10] to handle graph data structures. Since the same data structures are used, benchmarking the different algorithms provided in LSD allows a fair comparison of running times.
In the implementation, we deviated from the presentation above in two minor details. First, instead of using the reverse postorder of the DFS-tree, we directly used postorder and the corresponding (trivial) redefinitions of the helper functions OutChild ( ) and OutParent ( ) . Second, we did not completely separate the determination of the cycles, the identification of the roots, and the identification of the superbubbles. Instead, we performed cycle search, root detection, and superbubble identification immediately for each DFS-tree. Since cycles and superbubbles are necessarily completely contained within the DFS-trees, this does not affect the correctness of the algorithm. As a by-product, we obtained a speedup by a constant factor because cycles reachable within a given DFS-tree were marked as “already processed” in the superbubble detection step and hence were not (superfluously) considered as candidate additional roots.
In order to benchmark the direct detection algorithm in comparison to other linear-time superbubble detection algorithms, we used the same datasets as in our previous work [7]. In order to guarantee comparability, performance data for all algorithms were computed with the same version of LSD on the same hardware. The results are summarized in Table 1.
For most datasets, we observed an approximately three-fold speedup of Directbubble compared to LSD. The exception is the Slashdot dataset for which no performance gain was observed.
To understand this outlier, it is necessary to understand the source of the speedup in the other test cases. In a typical case, both Directbubble and LSD performed three depth-first searches: in LSD, they are used to determine SCCs, create auxiliary graphs, and detect superbubbles. Directbubble uses them to identify the cycles, quasi-legitimate roots, and finally the superbubbles. Both need to handle exceptional cases. LSD requires the construction of the Sung graph if an SCC coincides with a connected component of the input graph (rather than being just part of it). Since the Sung graph is twice the size of the SCC, this roughly doubles the running time. Directbubble behaves exceptionally for vertices that are reachable from a source. In this case, the detection of cycles and quasi-legitimate roots in cycles was skipped, incurring a substantial speedup. When a graph had neither an SCC that was also a connected component, nor large subgraphs reachable from a source, then LSD and Directbubble essentially performed the computations and thus performed very similarly. The Slashdot dataset is such a case. Typically, however, directed graphs have some sources so that Directbubble outperforms its competitors on most real-life graphs.

4. Conclusions

In this contribution, we extended the body of results describing properties of superbubbles, a particular class of induced subgraphs of a digraph. The analysis presented here was motivated by the observation that in principle, all superbubbles in G can be identified in linear time in a single depth-first search, provided the roots of the individual DFS-trees are known beforehand. Our main result is the observation that a suitable set of starting points, which we call quasi-legitimate roots, (1) always exists in every given digraph and (2) can be identified in linear time, using two additional DFSs. In the first pass, a suitable set of cycles is constructed such that every node in G is reachable from a source vertex of one of these cycles. In the second pass, a peculiar structure of “detours” in a cycle C is used to identify quasi-legitimate roots in a given cycle. To this end, we defined a notion of C -reachability that may also be interesting in its own right to characterize (short) cycles.
A comparison of running times of Directbubble and previous approaches shows that practically useful performance gains are obtained essentially from two sources: (1) we dispense with the construction of auxiliary graphs and (2) we can avoid most of the processing for all vertices reachable from a source in G. In practice, we observed a speedup of about a factor of three on most, but not all, benchmark cases. In all cases, Directbubble performed at least as good as all competing algorithms for superbubble detection.

Author Contributions

F.G. and P.F.S. designed the study, developed the theoretical results, and wrote the manuscript. F.G. implemented the algorithm and evaluated its performance.

Funding

This work was funded by the German Federal Ministry of Education and Research within the project Competence Center for Scalable Data Services and Solutions (ScaDS) Dresden/Leipzig (BMBF 01IS14014B). The authors acknowledge support from the German Research Foundation (DFG) and Universität Leipzig within the program of Open Access Publishing.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Paten, B.; Eizenga, J.M.; Rosen, Y.M.; Novak, A.M.; Garrison, E.; Hickey, G. Superbubbles, Ultrabubbles, and Cacti. J. Comput. Biol. 2018, 25, 649–663. [Google Scholar] [CrossRef] [PubMed]
  2. Onodera, T.; Sadakane, K.; Shibuya, T. Detecting superbubbles in assembly graphs. In Proceedings of the International Workshop on Algorithms in Bioinformatics, Sophia Antipolis, France, 2–4 September 2013; Darling, A., Stoye, J., Eds.; Springer: Berlin/Heidelberg, Germany, 2013; Volume 8126, pp. 338–348. [Google Scholar] [CrossRef]
  3. Simpson, J.T.; Pop, M. The Theory and Practice of Genome Sequence Assembly. Annu. Rev. Genomics Hum. Genet. 2015, 16, 153–172. [Google Scholar] [CrossRef] [PubMed]
  4. Baichoo, S.; Ouzounis, C.A. Computational complexity of algorithms for sequence comparison, short-read assembly and genome alignment. Biosystems 2017, 156–157, 72–85. [Google Scholar] [CrossRef] [PubMed]
  5. Sung, W.K.; Sadakane, K.; Shibuya, T.; Belorkar, A.; Pyrogova, I. An O(mlogm)-time algorithm for detecting superbubbles. IEEE/ACM Trans. Comput. Biol. Bioinf. 2015, 12, 770–777. [Google Scholar] [CrossRef] [PubMed]
  6. Brankovic, L.; Iliopoulos, C.S.; Kundu, R.; Mohamed, M.; Pissis, S.P.; Vayani, F. Linear-time superbubble identification algorithm for genome assembly. Theor. Comput. Sci. 2016, 609, 374–383. [Google Scholar] [CrossRef]
  7. Gärtner, F.; Müller, L.; Stadler, P.F. Superbubbles revisited. Algorithms Mol. Biol. 2018, 13, 16. [Google Scholar] [CrossRef] [PubMed]
  8. Tarjan, R. Depth-First Search and Linear Graph Algorithms. SIAM J. Comput. 1972, 1, 146–160. [Google Scholar] [CrossRef]
  9. Acuña, V.; Grossi, R.; Italiano, G.F.; Lima, L.; Rizzi, R.; Sacomoto, G.; Sagot, M.F.; Sinaimeri, B. On Bubble Generators in Directed Graphs. In Graph-Theoretic Concepts in Computer Science, 43rd ed.; Bodlaender, H.L., Woeginer, G.J., Eds.; Lecture Notes in Computer Science; Springer: Heidelberg, Germany, 2017; Volume 10520, pp. 18–31. [Google Scholar]
  10. Hagberg, A.; Schult, D.A.; Swart, P. Exploring network structure, dynamics, and function using NetworkX. In Proceedings of the 7th Python in Science Conference (SciPy 2008), Pasadena, CA, USA, 19–24 August 2008; pp. 11–16. [Google Scholar]
  11. Gärtner, F.; Höner zu Siederdissen, C.; Müller, L.; Stadler, P.F. Coordinate Systems for Supergenomes. Algorithms Mol. Biol. 2018, 13, 15. [Google Scholar] [CrossRef] [PubMed]
  12. Leskovec, J.; Krevl, A. SNAP Datasets: Stanford Large Network Dataset Collection. Available online: http://snap.stanford.edu/data (accessed on 26 November 2018).
Figure 1. Illustration of the algorithm Superbubble on a digraph G with cycles. The top panel shows the input digraph. The DFS-tree T is rooted at one and covers V [ r ] = V ( G ) \ { 15 } . The table below gives the values of OutParent and OutChild as a function of the reverse postorder π ¯ of T. In the final line, matching pairs of parentheses indicate entrances and exits of the weak superbubbles in V [ r ] . This corresponds to the intervals that fulfill (F1) and (F2).
Figure 1. Illustration of the algorithm Superbubble on a digraph G with cycles. The top panel shows the input digraph. The DFS-tree T is rooted at one and covers V [ r ] = V ( G ) \ { 15 } . The table below gives the values of OutParent and OutChild as a function of the reverse postorder π ¯ of T. In the final line, matching pairs of parentheses indicate entrances and exits of the weak superbubbles in V [ r ] . This corresponds to the intervals that fulfill (F1) and (F2).
Algorithms 12 00081 g001
Figure 2. Relationships of distinct C-intervals. The four possibilities for the relative location of two distinct C-intervals are shown on a linear layout of a cycle C with five vertices ( 0 , 1 , 2 , 3 , 4 ). Left top: the C-intervals 0 , 2 and 2 , 4 are disjoint. Right top: 0 , 4 includes 1 , 3 . Left bottom: 1 , 3 extends 0 , 2 , but not vice versa. Right bottom: 0 , 4 and 3 , 1 extend each other. Together, the two C-intervals cover C.
Figure 2. Relationships of distinct C-intervals. The four possibilities for the relative location of two distinct C-intervals are shown on a linear layout of a cycle C with five vertices ( 0 , 1 , 2 , 3 , 4 ). Left top: the C-intervals 0 , 2 and 2 , 4 are disjoint. Right top: 0 , 4 includes 1 , 3 . Left bottom: 1 , 3 extends 0 , 2 , but not vice versa. Right bottom: 0 , 4 and 3 , 1 extend each other. Together, the two C-intervals cover C.
Algorithms 12 00081 g002
Figure 3. C -covers. (a) The green cycle C in the top panel has five C -paths indicated in red. In the middle panel, C is laid out linearly to emphasize the C -covered intervals. Below, the clean C -cover obtained by removing all C -intervals that are contained in longer ones. Note that every C -interval is extended by another one; hence, the C -cover is total. (b) Again, the top panel highlights C in green and the C -paths in red. The linear layout below highlights that Vertex 1 is not C -covered. Thus, it is a C -cut vertex.
Figure 3. C -covers. (a) The green cycle C in the top panel has five C -paths indicated in red. In the middle panel, C is laid out linearly to emphasize the C -covered intervals. Below, the clean C -cover obtained by removing all C -intervals that are contained in longer ones. Note that every C -interval is extended by another one; hence, the C -cover is total. (b) Again, the top panel highlights C in green and the C -paths in red. The linear layout below highlights that Vertex 1 is not C -covered. Thus, it is a C -cut vertex.
Algorithms 12 00081 g003aAlgorithms 12 00081 g003b
Figure 4. One-vertex cover. As in Figure 3, the cycle C and the C -paths are highlighted in green and red, respectively. The paths ( 0 , 5 , 4 ) and ( 3 , 5 , 1 ) imply that 0 , 4 and 3 , 1 are C -covered. It is a one-vertex cover conforming to Lemma 9.
Figure 4. One-vertex cover. As in Figure 3, the cycle C and the C -paths are highlighted in green and red, respectively. The paths ( 0 , 5 , 4 ) and ( 3 , 5 , 1 ) imply that 0 , 4 and 3 , 1 are C -covered. It is a one-vertex cover conforming to Lemma 9.
Algorithms 12 00081 g004
Figure 5. A digraph G without any legitimate root. In G are 16 isomorphic cycles containing eight of the 12 vertices, all of which contain { 1 , 3 , 5 , 7 } . The superbubbles 1 , 3 , 3 , 5 , 5 , 7 , and 7 , 1 cover G entirely, i.e., every entrance of a superbubble is also the exit of another one, and all other vertices are interior vertices of a superbubble.
Figure 5. A digraph G without any legitimate root. In G are 16 isomorphic cycles containing eight of the 12 vertices, all of which contain { 1 , 3 , 5 , 7 } . The superbubbles 1 , 3 , 3 , 5 , 5 , 7 , and 7 , 1 cover G entirely, i.e., every entrance of a superbubble is also the exit of another one, and all other vertices are interior vertices of a superbubble.
Algorithms 12 00081 g005
Table 1. Comparison of running times. The five combinations of algorithms compared here are: Db (Directbubble) refers to the new approach described in this contribution. LSD (using the auxiliary graphs G ^ ( C ) and the stack-based superbubble detector) refers to the algorithm proposed in [7]. S + LSD combines the Sung graphs as auxiliary graphs [5] with LSD stack-based detector plus a post-filter for the false positives. LSD + B uses the LSD graph construction with the range-query-based detector of [6], and S + B uses Sung graphs together with the range-query-based detector, as well as the necessary post-filters; see [7] for full details. All computations were performed on a 2.5-GHz quad-core Intel Core i7 processor (Turbo Boost up to 3.7 GHz) with 6-MB shared L3 cache and 16 GB of 1600-MHz DDR3L onboard memory. Test datasets were taken from [11] and from the Stanford Large Network Dataset Collection [12]. For each test graph, we list the number of vertices N, the numbers of edges M, and the number S of superbubbles.
Table 1. Comparison of running times. The five combinations of algorithms compared here are: Db (Directbubble) refers to the new approach described in this contribution. LSD (using the auxiliary graphs G ^ ( C ) and the stack-based superbubble detector) refers to the algorithm proposed in [7]. S + LSD combines the Sung graphs as auxiliary graphs [5] with LSD stack-based detector plus a post-filter for the false positives. LSD + B uses the LSD graph construction with the range-query-based detector of [6], and S + B uses Sung graphs together with the range-query-based detector, as well as the necessary post-filters; see [7] for full details. All computations were performed on a 2.5-GHz quad-core Intel Core i7 processor (Turbo Boost up to 3.7 GHz) with 6-MB shared L3 cache and 16 GB of 1600-MHz DDR3L onboard memory. Test datasets were taken from [11] and from the Stanford Large Network Dataset Collection [12]. For each test graph, we list the number of vertices N, the numbers of edges M, and the number S of superbubbles.
DataNMSRunning Times (s)
DbLSDS + LSDLSD + BS + B
Yeast49,795130,99332513459
EU Mail265,214420,04513,285514163032
Slashdot82,168948,46401616302237
Amazon403,3943,387,388313599384159
Google875,7135,105,03964772695147152255
Wikipedia2,394,3855,021,410473752160164382418

Share and Cite

MDPI and ACS Style

Gärtner, F.; Stadler, P.F. Direct Superbubble Detection. Algorithms 2019, 12, 81. https://doi.org/10.3390/a12040081

AMA Style

Gärtner F, Stadler PF. Direct Superbubble Detection. Algorithms. 2019; 12(4):81. https://doi.org/10.3390/a12040081

Chicago/Turabian Style

Gärtner, Fabian, and Peter F. Stadler. 2019. "Direct Superbubble Detection" Algorithms 12, no. 4: 81. https://doi.org/10.3390/a12040081

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop