# Modified Classical Graph Algorithms for the DNA Fragment Assembly Problem

^{1}

^{2}

^{3}

^{*}

^{†}

## Abstract

**:**

## 1. Introduction

n | Observed | Probability | Expected Value |
---|---|---|---|

15 | 101 | 0.000000475 | 2.528889873 |

16 | 74 | 0.000000147 | 0.784225487 |

17 | 65 | 0.000000046 | 0.242640081 |

18 | 51 | 0.000000014 | 0.074884674 |

19 | 40 | 4.32861E−09 | 0.023046972 |

20 | 39 | 1.32807E−09 | 0.007071095 |

21 | 18 | 4.06052E−10 | 0.002161959 |

22 | 28 | 1.23662E−10 | 0.000658416 |

23 | 14 | 3.74924E−11 | 0.000199622 |

24 | 15 | 1.13089E−11 | 0.000060212 |

25 | 22 | 3.3908E−12 | 0.000018054 |

26 | 15 | 1.00958E−12 | 0.000005375 |

27 | 22 | 2.9809E−13 | 0.000001587 |

28 | 11 | 8.71281E−14 | 0.000000464 |

29 | 15 | 2.51495E−14 | 0.000000134 |

30 | 10 | 7.14487E−15 | 3.80417E−08 |

31 | 9 | 1.98796E−15 | 1.05846E−08 |

32 | 6 | 5.37563E−16 | 2.86217E−09 |

33 | 4 | 1.3946E−16 | 7.42531E−10 |

34 | 4 | 3.38757E−17 | 1.80366E−10 |

35 | 1 | 7.29107E−18 | 3.88201E−11 |

## 2. Use of Graph Theory

#### 2.1. Generalities

#### 2.2. DNA Fragment Assembly as a Graph

Fragment number | Sequence |
---|---|

1 | GTGTACCACGTACTGATGTACTATTTGAAGCTTAT |

2 | CCCAATTCCTAATGTACTATTTGAAGCTTATTCGG |

3 | CATAAGCTTCATGATGAAGCTTATTCGGCCAATCG |

4 | TTTGATTCCTGCTGATGTACTATTTGATGAAGCTT |

5 | ATGTACTATTTGAAGCTTATTCGGCCAATCGTACT |

6 | GAAGCTTATTCGGCCAATCGTACTGATGTACTATT |

7 | CTTATTCGGCCAATCGTACTATTTACTGATGTACA |

8 | TGATGAAGCTTATTCGGCCAATCGTACTGATGTAC |

9 | GGCCAATCGTACTGATGTACTATTTGATGAAGCTT |

10 | CTGATGTACTATTTGATGAAGCTTATTCGGCCAAT |

11 | TGTACTATTTGATGAAGCTTATCAGTACGTGGAAC |

12 | AATCGTACTGATGTACTATTTACTGATGTACAATA |

13 | CTATTTACTGATGTACAATAGTACATCAGTAAAAA |

1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

1 | 20 | ||||||||||||

2 | 24 | ||||||||||||

3 | 20 | 24 | |||||||||||

4 | 24 | ||||||||||||

5 | 20 | 20 | |||||||||||

6 | 24 | 20 | |||||||||||

7 | 17 | ||||||||||||

8 | 20 | ||||||||||||

9 | 24 | 20 | |||||||||||

10 | 22 | ||||||||||||

11 | |||||||||||||

12 | 20 | ||||||||||||

13 |

#### 2.3. Objective Function

## 3. Algorithms

#### 3.1. Basic Algorithm for F_{1} and F_{2}

_{1}, we propose an algorithm similar to topological sorting. Let G = (V, E) be a directed acyclic graph (DAG) where V is the set of vertices, E is the set of edges and w(u, v) is the weight of edge u→v. In this graph, the longest path will go from a node that has no edge going in (initial vertex) and those nodes that are connected to it. Notice that there might be several “initial vertices”, and that there is no single privileged node. Let $S\subseteq V$ be the set of vertices with in-degree d

_{in}(v) = 0 and let Q be a stack. The algorithm to determine the longest distance from a given vertex to each of the other vertices is the following (Algorithm 1):

Algorithm 1. Longest distance from initial vertices |

1. For each vertex $v\in S$ with d_{in}(v) = 0 |

1.1. push(Q,v) |

1.2. d(v):=0 |

1.3. origin(v):=v |

2. While Q is not empty |

2.1. u:=pop(Q) |

2.2. for each vertex v $(u,v)\in E$ |

2.2.1. d_{in}(v):=d_{in}(v)−1 |

2.2.2. if d_{in}(v)=0 |

2.2.2.1. push(Q,v) |

2.2.2.2. $d(v):=\underset{(x,v)\in E}{\mathrm{max}}\left(d(x)+w(x,v\right)$ |

2.2.2.3. father(v):=x |

#### 3.2. Constant Time Heap

_{i}for each possible value to be introduced. The heap operations are:

Algorithm 2. Stack heap insert and extract min |

insert x |

1. push(P_{x},x) |

2. if x<xmin |

2.1. xmin:=x |

3. if x > xmax |

3.1. xmax:=x |

extract-min |

1. x:=pop(P_{xmin}) |

2. While the stack P_{xmin} is empty and xmin ≤ xmax |

2.1. xmin: = xmin + 1 |

_{min}is extracted. If the stack is empty, the next non-empty stack smaller than the maximum value is used. When the number of stacks is large enough, it is useful to use some other kind of heap, such as a binary or a Fibonacci heap instead of sequential search. In the example that we provide in Section 4, we use no more than 34 stacks (0 ≤ x ≤ 34), hence sequential search turns out to be faster. The time complexity of the insert operation is constant because Steps 1 through 3.1 always require the same amount of time. In the extract-min operation, Step 1 is of constant time and Step 2, which is only executed when the stack with the minimum values is empty, in the worst case depends on the value x

_{max}, and could be linear, in the case of sequential search, or logarithmic if a heap is used. In the case of DNA fragment assembly, millions of edges or nodes will be inserted into the heap, hence the value of x

_{max}should be close to a few hundred, which makes the probability of Step 2 executing more than once negligible. In any case, since the time complexity of our heap is independent from the number of nodes or vertices inserted into it, then Prim’s algorithm [15] as well as Kruskal’s [16] algorithm become linear. Even though our technique is designed to work in the particular case of DNA fragment assembly, it is possible that it could also be used in other problems.

#### 3.3. MST in Linear Time

Algorithm 3. Prim’s algorithm |

1. For some vertex u |

1.1. Put u in S |

1.2. For each $(u,v)\in E$ |

1.2.1. insert w(u,v) in the heap |

2. While the heap is not empty |

2.1. Extract the edge (u, v) from the heap |

2.1.1. if $u\notin S$ |

2.1.1.1. Put u in S and (u,v) in F |

2.1.1.1.1. For each $(u,v)\in E$ insert w(u,x) in the heap |

2.1.2. if $v\notin S$ |

2.1.2.1. Put v in S and (v,u) in F |

2.1.2.1.1. For each $(v,x)\in E$ insert w(v,x) in the heap |

Algorithm 4. Kruskal’s algorithm |

1. Insert E in a heap in increasing order of w(u,v) |

2. Create a forest F where each vertex is an independent tree |

3. While the heap is not empty |

3.1. Extract (u,v) from heap |

3.1.1. If u and v belong to different trees, merge both trees |

#### 3.4. Modification of the Basic Algorithm

_{in}(u) never goes to zero because at least one edge comes from the cycle. When Algorithm 1 is done, we can take one of the nodes that went through Step 2.2.1, but that remained with a final value of d

_{in}(u) > 0, and insert it into the stack to continue with Step 2 of the algorithm. It is necessary, however, to mark those nodes so that they will not be considered again and to avoid the cycle being traversed more than once. The total distance from a start node to the last processed node in the cycle is the sum of the distance from a start point to the cycle entry point plus the length of the cycle, minus the edge that would close the cycle on the start node in the cycle. So we have that if the selected node is s, the previous node in the cycle is t, and the total length of the cycle is C then:

Algorithm 5. Modified maximum distance algorithm |

1. For each $v\in S$ vertex with d_{in}(v) = 0 |

1.1. push(Q,v) |

1.2. d(v):=0 |

1.3. origin(v):=v |

2. While there are vertices v with d_{in}(v) > 0 |

2.1. While Q is not empty |

2.1.1. u:=pop(Q) |

2.1.2. For every vertex v where $(u,v)\in E$ and v is not marked |

2.1.2.1. d_{in}(v):=d_{in}(v)-1 |

2.1.2.2. If d_{in}(v)=0 |

2.1.2.2.1. push(Q,v) |

2.1.2.2.2. $d(v):=\underset{(x,v)\in E}{\mathrm{max}}\left(d(x)+w(x,v\right)$ |

2.1.2.2.3. insert d(v) in a heap |

2.1.2.2.4. father(v):=x |

2.1.2.2.5. origin(v):=origin(x) |

2.2. If the heap is not empty |

2.2.1. extract the node v from heap |

2.2.1.1. If d_{in}(v)>0 |

2.2.1.1.1. Make d_{in}(v)=0 |

2.2.1.1.2. Insert v in the stack |

2.2.1.1.3. Mark the node v |

2.2.1.1.4. Go to step 2.1 |

#### 3.5. Assembly Algorithm

## 4. Experiments

Problem | ||||||
---|---|---|---|---|---|---|

Fragments | 22,448 | 89,718 | 298,194 | 730,201 | 2,278,504 | |

Contigs Obtained | 186 | 258 | 575 | 837 | 2046 | |

Total Number of Bases | 26,284 | 102,684 | 314,120 | 788,198 | 2,526,278 | |

Contigs Found | Contigs Found | 184 | 256 | 565 | 823 | 1920 |

% | 98.9 | 99.2 | 98.3 | 98.3 | 93.8 | |

Bases Found | 26,024 | 102,249 | 310,965 | 777,583 | 2,463,563 | |

% | 99 | 99.6 | 99 | 98.7 | 97.5 | |

N50 | 188 | 1247 | 2911 | 4132 | 5250 | |

Average Length | 141.4 | 399.4 | 550.4 | 944.8 | 1283.10 | |

Minimum Length | 52 | 52 | 52 | 52 | 52 | |

Maximum Length | 880 | 7295 | 10,985 | 20,521 | 20,579 | |

Contigs not Found | Contigs not Found | 2 | 2 | 10 | 14 | 126 |

% | 1.1 | 0.8 | 1.7 | 1.7 | 6.2 | |

Bases not Found | 260 | 435 | 3155 | 10,615 | 62,715 | |

% | 1 | 0.4 | 1 | 1.3 | 2.5 | |

N50 | NA | NA | NA | NA | 5599 | |

Average Length | 130 | 217.5 | 315.5 | 758.2 | 497.7 | |

Minimum Length | 99 | 99 | 56 | 56 | 52 | |

Maximum Length | 161 | 336 | 2521 | 5941 | 18,158 |

Problem | |||||||
---|---|---|---|---|---|---|---|

Algorithm | Fragments | 22,448 | 89,718 | 298,194 | 730,201 | 2,278,504 | |

Edena | Edena | Contigs Found | 184 | 256 | 565 | 823 | 1920 |

% | 98.9 | 99.2 | 98.3 | 98.3 | 93.8 | ||

Bases Found | 26,024 | 102,249 | 310,965 | 777,583 | 2,463,563 | ||

% | 99 | 99.6 | 99 | 98.7 | 97.5 | ||

Algorithm 1 | F1 Objective Function, all Overlaps. | Contigs Found | 74 | 101 | 149 | 204 | 480 |

% | 90.2 | 88.6 | 91.4 | 91.1 | 94.1 | ||

Bases Found | 15,822 | 70,250 | 206,826 | 463,582 | 1,102,152 | ||

% | 81.8 | 83 | 86.1 | 83.9 | 83.4 | ||

Algorithm 1 | F2 Objective Function, all Overlaps | Contigs found | 74 | 102 | 146 | 189 | 450 |

% | 90.2 | 89.5 | 89.6 | 84.4 | 88.2 | ||

Bases found | 15,515 | 70,843 | 189,927 | 369,001 | 931,343 | ||

% | 81.5 | 83.8 | 79 | 66.8 | 69.9 | ||

Algorithm 1 | F1 Objective Function, no Transitive overlaps | Contigs found | 74 | 102 | 149 | 204 | 481 |

% | 90.2 | 89.5 | 91.4 | 91.1 | 93.9 | ||

Bases found | 15,806 | 71,652 | 206,826 | 462,774 | 1,103,587 | ||

% | 83.7 | 84.6 | 86.1 | 83.9 | 82.7 | ||

Algorithm 1 | F2 Objective Function, no Transitive Overlaps | Contigs found | 75 | 103 | 142 | 181 | 417 |

% | 91.5 | 90.4 | 87.1 | 80.8 | 81.4 | ||

Bases found | 15,599 | 72,245 | 174,244 | 320,502 | 791,376 | ||

% | 84 | 85.4 | 72.5 | 58.1 | 58.9 | ||

Algorithm 1 | F1 Objective Function, MST | Contigs found | 304 | 1827 | 5980 | 15,443 | 46,187 |

% | 91 | 97 | 98.4 | 98.9 | 95.7 | ||

Bases found | 35,980 | 256,641 | 815,994 | 2,081,306 | 6,208,258 | ||

% | 89.8 | 96.6 | 98.1 | 98.7 | 93.9 | ||

Algorithm 5 | F1 Objective function, no Transitive Edges | Contigs found | 74 | 102 | 150 | 205 | 480 |

% | 90.2 | 88.7 | 91.5 | 91.1 | 94.1 | ||

Bases found | 15,822 | 70,318 | 206,927 | 463,684 | 1,102,152 | ||

% | 81.8 | 83 | 86.1 | 83.9 | 83.4 |

Problem | ||||||
---|---|---|---|---|---|---|

Fragments | 22,448 | 89,718 | 298,194 | 730,201 | 2,278,504 | |

Contigs Obtained | 82 | 114 | 163 | 224 | 510 | |

Total Number of Bases | 19,343 | 84,659 | 240,314 | 552,408 | 1,321,969 | |

Contigs Found | Contigs Found | 74 | 101 | 149 | 204 | 480 |

% | 90.2 | 88.6 | 91.4 | 91.1 | 94.1 | |

Bases Found | 15,822 | 70,250 | 206,826 | 463,582 | 1,102,152 | |

% | 81.8 | 83 | 86.1 | 83.9 | 83.4 | |

N50 | 360 | 1614 | 3326 | 4937 | 4499 | |

Average Length | 213.8 | 695.5 | 1388.10 | 2272.50 | 2296.20 | |

Minimum Length | 54 | 58 | 54 | 62 | 53 | |

Maximum Length | 882 | 7261 | 10,943 | 22,745 | 24,396 | |

Contigs not Found | Contigs not Found | 8 | 13 | 14 | 20 | 30 |

% | 9.8 | 11.4 | 8.6 | 8.9 | 5.9 | |

Bases not Found | 3521 | 14,409 | 33,488 | 88,826 | 219,817 | |

% | 18.2 | 17 | 13.9 | 16.1 | 16.6 | |

N50 | 758 | 2308 | 3954 | 7258 | 11,798 | |

Average Length | 440.1 | 1108.40 | 2392.00 | 4441.30 | 7327.20 | |

Minimum Length | 95 | 95 | 95 | 181 | 255 | |

Maximum Length | 826 | 3703 | 6809 | 9542 | 23,614 |

Problem | ||||||
---|---|---|---|---|---|---|

Fragments | 22,448 | 89,718 | 298,194 | 730,201 | 2,278,504 | |

Contigs Obtained | 82 | 114 | 163 | 224 | 510 | |

Total Number of Bases | 19,028 | 84,584 | 240,280 | 552,509 | 1,331,757 | |

Contigs Found | Contigs Found | 74 | 102 | 146 | 189 | 450 |

% | 90.2 | 89.5 | 89.6 | 84.4 | 88.2 | |

Bases Found | 15,515 | 70,843 | 189,927 | 369,001 | 931,343 | |

% | 81.5 | 83.8 | 79 | 66.8 | 69.9 | |

N50 | 356 | 1527 | 3184 | 4298 | 4201 | |

Average Length | 209.7 | 694.5 | 1300.90 | 1952.40 | 2069.70 | |

Minimum Length | 54 | 58 | 54 | 59 | 53 | |

Maximum Length | 882 | 7261 | 10,943 | 21,185 | 16,135 | |

Contigs not Found | Contigs not Found | 8 | 12 | 17 | 35 | 60 |

% | 9.8 | 10.5 | 10.4 | 15.6 | 11.8 | |

Bases not Found | 3513 | 13,741 | 50,353 | 183,508 | 400,414 | |

% | 18.5 | 16.2 | 21 | 33.2 | 30.1 | |

N50 | 758 | 2308 | 6336 | 7620 | 9472 | |

Average Length | 439.1 | 1145.10 | 2961.90 | 5243.10 | 6673.60 | |

Minimum Length | 95 | 96 | 103 | 181 | 251 | |

Maximum Length | 826 | 3703 | 7626 | 22,745 | 24,396 |

_{1}and F

_{2}using all overlaps, we find that F

_{1}is a little bit better than F

_{2}, except in the number of bases found for problem 89,718. In terms of multi objective function optimization, the solution obtained for F

_{1}is not enough to dominate the solution obtained by F

_{2}, but since it is better in most of the test, we can say that it is almost dominant. In the other experiments, we find a similar situation, where F

_{1}is almost dominant with respect to F

_{2}.

_{1}and F

_{2}. The results are given in Table 5, Table 8 and Table 9.

Problem | ||||||
---|---|---|---|---|---|---|

Fragments | 22,448 | 89,718 | 298,194 | 730,201 | 2,278,504 | |

Contigs Obtained | 82 | 114 | 163 | 224 | 512 | |

Total Number of Bases | 18,879 | 84,646 | 240,314 | 551,600 | 1,333,881 | |

Contigs Found | Contigs Found | 74 | 102 | 149 | 204 | 481 |

% | 90.2 | 89.5 | 91.4 | 91.1 | 93.9 | |

Bases Found | 15,806 | 71,652 | 206,826 | 462,774 | 1,103,587 | |

% | 83.7 | 84.6 | 86.1 | 83.9 | 82.7 | |

N50 | 360 | 1527 | 3326 | 4937 | 4499 | |

Average Length | 213.6 | 702.5 | 1388.10 | 2268.50 | 2294.40 | |

Minimum Length | 54 | 58 | 54 | 62 | 53 | |

Maximum Length | 882 | 7261 | 10,943 | 22,745 | 24,396 | |

Contigs not Found | Contigs not Found | 8 | 12 | 14 | 20 | 31 |

% | 9.8 | 10.5 | 8.6 | 8.9 | 6.1 | |

Bases not Found | 3073 | 12,994 | 33,488 | 88,826 | 230,294 | |

% | 16.3 | 15.4 | 13.9 | 16.1 | 17.3 | |

N50 | 539 | 2308 | 3954 | 7258 | 11,657 | |

Average Length | 384.1 | 1082.80 | 2392.00 | 4441.30 | 7428.80 | |

Minimum Length | 95 | 95 | 95 | 181 | 255 | |

Maximum Length | 826 | 3703 | 6809 | 9542 | 23,614 |

_{1}is better than F

_{2}without dominating it. Comparing the results where transitive overlaps are excluded (Table 8 and Table 9 or Table 5) to those where they are not (Table 6 and Table 7 or Table 5), we can see that the results are similar for F

_{1}, hence it is not clear whether it is better to exclude or not to exclude transitive overlaps. There is, however, an obvious advantage of excluding transitive overlaps, which is that the problem size is reduced and requires less computational resources. So, our recommendation is to exclude transitive overlaps.

Problem | ||||||
---|---|---|---|---|---|---|

Fragments | 22,448 | 89,718 | 298,194 | 730,201 | 2,278,504 | |

Contigs Obtained | 82 | 114 | 163 | 224 | 512 | |

Total Number of Bases | 18,579 | 84,571 | 240,265 | 551,701 | 1,343,637 | |

Contigs Found | Contigs Found | 75 | 103 | 142 | 181 | 417 |

% | 91.5 | 90.4 | 87.1 | 80.8 | 81.4 | |

Bases Found | 15,599 | 72,245 | 174,244 | 320,502 | 791,376 | |

% | 84 | 85.4 | 72.5 | 58.1 | 58.9 | |

N50 | 356 | 1527 | 2984 | 3910 | 4135 | |

Average Length | 208 | 701.4 | 1227.10 | 1770.70 | 1897.80 | |

Minimum Length | 54 | 58 | 54 | 59 | 53 | |

Maximum Length | 882 | 7,261 | 10,943 | 10,648 | 16,135 | |

Contigs not Found | Contigs not Found | 7 | 11 | 21 | 43 | 95 |

% | 8.5 | 9.6 | 12.9 | 19.2 | 18.6 | |

Bases not Found | 2980 | 12,326 | 66,021 | 231,199 | 552,261 | |

% | 16 | 14.6 | 27.5 | 41.9 | 41.1 | |

N50 | 539 | 2308 | 6336 | 8135 | 7554 | |

Average Length | 425.7 | 1120.50 | 3143.90 | 5376.70 | 5813.30 | |

Minimum Length | 95 | 96 | 103 | 181 | 251 | |

Maximum Length | 826 | 3703 | 10,648 | 22,745 | 24,396 |

Problem | ||||||
---|---|---|---|---|---|---|

Fragments | 22,448 | 89,718 | 298,194 | 730,201 | 2,278,504 | |

Contigs Obtained | 334 | 1883 | 6080 | 15,618 | 48,263 | |

Total Number of Bases | 40,076 | 265,584 | 831,449 | 2,109,070 | 6,613,747 | |

Contigs Found | Contigs Found | 304 | 1827 | 5980 | 15,443 | 46,187 |

% | 91 | 97 | 98.4 | 98.9 | 95.7 | |

Bases Found | 35,980 | 256,641 | 815,994 | 2,081,306 | 6,208,258 | |

% | 89.8 | 96.6 | 98.1 | 98.7 | 93.9 | |

N50 | 135 | 165 | 154 | 153 | 153 | |

Average Length | 118.4 | 140.5 | 136.5 | 134.8 | 134.4 | |

Minimum Length | 62 | 62 | 61 | 57 | 57 | |

Maximum Length | 446 | 660 | 747 | 894 | 1152 | |

Contigs not Fund | Contigs not Found | 30 | 56 | 100 | 175 | 2076 |

% | 9 | 3 | 1.6 | 1.1 | 4.3 | |

Bases not Found | 4096 | 8943 | 15,455 | 27,764 | 405,489 | |

% | 10.2 | 3.4 | 1.9 | 1.3 | 6.1 | |

N50 | 154 | 199 | 198 | 194 | 245 | |

Average Length | 136.5 | 159.7 | 154.6 | 158.7 | 195.3 | |

Minimum Length | 64 | 65 | 65 | 64 | 63 | |

Maximum Length | 362 | 481 | 481 | 493 | 1130 |

_{1}(Table 5 and Table 10). In this case, we were able to find much more bases in the genome than in previous experiments, including Edena, but the size of our contigs is small. There appears to be a tradeoff between the number of bases that can be found and contig length.

_{1}(F

_{1}was again better than F

_{2}). Analyzing the contigs obtained by the algorithm, we found that if the contig is split at the edge that joins the cycle with the path, we have a better chance of finding both contigs. Therefore, we modified Algorithm 4 making d(v) = 0 in Step 2.2.1.1 and repeated Experiment 4 without transitive overlaps. In this case, the objective function F

_{1}was better than F

_{2}without being dominant. These results are given in Table 5 and Table 11.

Problem | ||||||
---|---|---|---|---|---|---|

Fragments | 22,448 | 89,718 | 298,194 | 730,201 | 2,278,504 | |

Contigs Obtained | 82 | 115 | 164 | 225 | 510 | |

Total Number of Bases | 19,343 | 84,727 | 240,415 | 552,510 | 1,321,969 | |

Contigs Found | Contigs Found | 74 | 102 | 150 | 205 | 480 |

% | 90.2 | 88.7 | 91.5 | 91.1 | 94.1 | |

Bases Found | 15,822 | 70,318 | 206,927 | 463,684 | 1,102,152 | |

% | 81.8 | 83 | 86.1 | 83.9 | 83.4 | |

N50 | 360 | 1614 | 3326 | 4937 | 4499 | |

Average Length | 213.8 | 689.4 | 1379.50 | 2261.90 | 2296.20 | |

Minimum Length | 54 | 58 | 54 | 62 | 53 | |

Maximum Length | 882 | 7261 | 10,943 | 22,745 | 24,369 | |

Contigs not Found | Contigs not Fund | 8 | 13 | 14 | 20 | 30 |

% | 9.8 | 11.3 | 8.5 | 8.9 | 5.9 | |

Bases not Found | 3521 | 14,409 | 33,488 | 88,826 | 219,817 | |

% | 18.2 | 17 | 13.9 | 16.1 | ||

N50 | 758 | 2308 | 3954 | 7258 | 11,798 | |

Average Length | 440.1 | 1108.40 | 2392.00 | 4441.30 | 7327.20 | |

Minimum Length | 95 | 95 | 95 | 181 | 255 | |

Maximum Length | 826 | 3703 | 6809 | 9542 | 23,614 |

_{1}, there were 513 contigs from which 414 were found directly, 76 were split in two to be found, 14 in three, 10 in four or more pieces, and two produced pieces that were too small to be considered. With appropriate rules, it should be possible to find the correct split points in most of the cases, as other assemblers do. We also analyzed residual paths, after removing paths of maximum length, from which we considered the possibility of recovering several long contings, which we will do in the future.

## 5. Conclusions and Future Work

_{1}and F

_{2}, even though one does not dominate over the other, better results are obtained using F

_{1}, hence we recommend its use. We also recommend the removal of transitive overlaps because of resource reduction considerations, even though the results of the assembly are almost the same for function F

_{1}. The use of the MST considerably improves the number of correct bases, but the contigs obtained in this way are too small. It can be seen from the experiments that there is a tradeoff between contig length and the number of bases detected. Comparing our results to those of Edena is not completely objective since we do not apply heuristics after the algorithms are executed in order to refine the solutions, but Edena (as well as the Velvet assembler) do so intensively. In any case, compared to Edena, our method produced, in some cases, less bases but longer contigs, while, in other cases, we found more bases, but shorter contigs, especially when using MST. Therefore, the problem must be considered as a multi objective optimization, with an objective function that is able to maximize the number of mean contig length, giving the user a Pareto Front from which he can obtain the most convenient solution according to his particular criteria.

- Developing rules to split contigs in such a way that most of the pieces can be found.
- Attempting to recover other contigs after extracting the longest ones.
- Implementing a solution to DNA fragment assembly as a multi objective optimization problem.

## Acknowledgments

## Author Contributions

## Conflicts of Interest

## References

- Watson, J.D.; Crick, F.H. Molecular structure of nucleic acids. Nature
**1953**, 171, 737–738. [Google Scholar] [CrossRef] [PubMed] - Sanger, F.; Nicklen, S.; Coulson, A.R. DNA sequencing with chain-terminating inhibitors. Proc. Natl. Acad. Sci.
**1977**, 74, 5463–5467. [Google Scholar] [CrossRef] [PubMed] - Staden, R. A strategy of DNA sequencing employing computer programs. Nucl. Acid. Res.
**1979**, 6, 2601–2610. [Google Scholar] [CrossRef] - Van Belkum, A.; Scherer, S.; van Alphen, L.; Verbrugh, H. Short-sequence DNA repeats in prokaryotic genomes. Microbiol. Mol. Biol. Rev.
**1998**, 62, 275–293. [Google Scholar] - Salzberg, S.L.; Phillippy, A.M.; Zimin, A.; Puiu, D.; Magoc, T.; Koren, S.; Yorke, J.A. GAGE: A critical evaluation of genome assemblies and assembly algorithms. Gen. Res.
**2012**, 22, 557–567. [Google Scholar] [CrossRef] [PubMed] - Shendure, J.; Ji, H. Next-generation DNA sequencing. Nat. Biotechnol.
**2008**, 26, 1135–1145. [Google Scholar] [CrossRef] [PubMed] - Pevzner, P.A.; Tang, H.; Waterman, M.S. An Eulerian path approach to DNA fragment assembly. Proc. Natl. Acad. Sci.
**2001**, 98, 9748–9753. [Google Scholar] [CrossRef] [PubMed] - Parsons, R.J.; Forrest, S.; Burks, C. Genetic algorithms for DNA sequence assembly. In Proceedings of the First International Conference on Intelligent Systems for Molecular Biology (ISMB), Bethesda, MD, USA, 6–9 July 1993; pp. 310–318.
- Krause, J.; Cordeiro, J.; Parpinelli, R.S.; Lopes, H.S. A Survey of Swarm Algorithms Applied to Discrete Optimization Problems. In Swarm Intelligence and Bio-inspired Computation: Theory and Applications; Elsevier Science Publishers: Amsterdam, The Netherlands, 2013. [Google Scholar]
- Alba, E.; Luque, G. A new local search algorithm for the DNA fragment assembly problem. In Evolutionary Computation in Combinatorial Optimization; Springer: Berlin/Heidelberg, Germany, 2007; pp. 1–12. [Google Scholar]
- Luque, G.; Alba, E. Metaheuristics for the DNA fragment assembly problem. Int. J. Comput. Intel. Res.
**2005**, 1, 98–108. [Google Scholar] [CrossRef] - Firoz, J.S.; Rahman, M.S.; Saha, T.K. Bee algorithms for solving DNA fragment assembly problem with noisy and noiseless data. In Proceedings of the 14th ACM Annual Conference on Genetic and Evolutionary Computation, Philadelphia, PA, USA, 7–11 July 2012; pp. 201–208.
- Mallen-Fullerton, G.M.; Fernandez-Anaya, G. DNA fragment assembly using optimization. In Proceedings of the 2013 IEEE Congress on Evolutionary Computation (CEC), Cancun, Mexico, 20–23 June 2013; pp. 1570–1577.
- Gallant, J.; Maier, D.; Astorer, J. On finding minimal length superstrings. J. Comput. Syst. Sci.
**1980**, 20, 50–58. [Google Scholar] [CrossRef] - Prim, R.C. Shortest connection networks and some generalizations. Bell Syst. Tech. J.
**1957**, 36, 1389–1401. [Google Scholar] [CrossRef] - Kruskal, J.B. On the shortest spanning subtree of a graph and the traveling salesman problem. Proc. Am. Math. Soc.
**1956**, 7, 48–50. [Google Scholar] [CrossRef] - Hopcroft, J.E.; Ullman, J.D. Set merging algorithms. SIAM J. Comput.
**1973**, 2, 294–303. [Google Scholar] [CrossRef] - Bonfield, J.K.; Smith, K.; Staden, R. A new DNA sequence assembly program. Nucl. Acid. Res.
**1995**, 23, 4992–4999. [Google Scholar] [CrossRef] - Mallén-Fullerton, G.M.; Hughes, J.A.; Houghten, S.; Fernández-Anaya, G. Benchmark datasets for the DNA fragment assembly problem. Int. J. Bio-Inspir. Comput.
**2013**, 5, 384–394. [Google Scholar] [CrossRef] - Staphylococcus aureus subsp. aureus MW2 DNA, complete genome, GenBank: BA000033.2. Available online: http://www.ncbi.nlm.nih.gov/nuccore/47118312?report=fasta (accessed on 3 June 2015).

© 2015 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Mallén-Fullerton, G.M.; Quiroz-Ibarra, J.E.; Miranda, A.; Fernández-Anaya, G.
Modified Classical Graph Algorithms for the DNA Fragment Assembly Problem. *Algorithms* **2015**, *8*, 754-773.
https://doi.org/10.3390/a8030754

**AMA Style**

Mallén-Fullerton GM, Quiroz-Ibarra JE, Miranda A, Fernández-Anaya G.
Modified Classical Graph Algorithms for the DNA Fragment Assembly Problem. *Algorithms*. 2015; 8(3):754-773.
https://doi.org/10.3390/a8030754

**Chicago/Turabian Style**

Mallén-Fullerton, Guillermo M., J. Emilio Quiroz-Ibarra, Antonio Miranda, and Guillermo Fernández-Anaya.
2015. "Modified Classical Graph Algorithms for the DNA Fragment Assembly Problem" *Algorithms* 8, no. 3: 754-773.
https://doi.org/10.3390/a8030754