Construction of Whole Genomes from Scaffolds Using Single Cell Strand-Seq Data

Accurate reference genome sequences provide the foundation for modern molecular biology and genomics as the interpretation of sequence data to study evolution, gene expression, and epigenetics depends heavily on the quality of the genome assembly used for its alignment. Correctly organising sequenced fragments such as contigs and scaffolds in relation to each other is a critical and often challenging step in the construction of robust genome references. We previously identified misoriented regions in the mouse and human reference assemblies using Strand-seq, a single cell sequencing technique that preserves DNA directionality Here we demonstrate the ability of Strand-seq to build and correct full-length chromosomes by identifying which scaffolds belong to the same chromosome and determining their correct order and orientation, without the need for overlapping sequences. We demonstrate that Strand-seq exquisitely maps assembly fragments into large related groups and chromosome-sized clusters without using new assembly data. Using template strand inheritance as a bi-allelic marker, we employ genetic mapping principles to cluster scaffolds that are derived from the same chromosome and order them within the chromosome based solely on directionality of DNA strand inheritance. We prove the utility of our approach by generating improved genome assemblies for several model organisms including the ferret, pig, Xenopus, zebrafish, Tasmanian devil and the Guinea pig.


Introduction
The mouse [1] and human [2] genome references have revolutionized biomedical research and facilitated many advances in studies of transcription, epigenetics, genetic variation, evolution, and cancer [3]. However, while both assemblies are of very high quality, they still contain fragments that have not been localized to specific chromosomes, and large regions (typically flanked by unbridged gaps) that are incorrectly oriented with respect to adjacent scaffolds [4,5]. These features highlight the difficulty in finishing genome maps, with typically repetitive or degenerate regions preventing robust overlapping/contiguous sequence across the length of the chromosome. As methods improve, assemblies themselves evolve over time as sequences are added, gaps are closed, and errors resolved.
For example, in the 13 years from the first public release of the complete human genome sequence (NCBI33) [6] to the current assembly (GRCh38), the total number of represented nucleotides has only increased 2.79% (82. 27 Mb). While the change in genomic content between these two builds appears relatively modest, the change in the organization of the sequence has been dramatic. Regions with unknown local order and orientation have been corrected and placed, and incorrectly merged artefacts such as pseudo-duplications, misorientations, and chimeras have been repaired. Correctly arranging available sequence data is therefore as important as uncovering new sequences in the process of improving genome references. Indeed, much of the drive to discover additional sequences revolves around the need to physically connect and orient contigs and scaffolds within the assembly, which is especially challenging within tracks of repetitive DNA. The methods involved in gap resolution and reorientation typically involve deeper sequencing of genomic DNA or Bacterial Artificial Chromosome libraries [7], but often also rely on novel methods such as optical mapping [8,9] and long-read sequencing technologies [10][11][12][13]. Recent studies have shown that improvements to optical mapping (termed whole-genome mapping) can facilitate de novo genome assemblies when used in conjunction with massively parallel sequencing (MPS) [8]. This method involves creating scaffolds from sequencing libraries of genomic DNA and fosmid clones, followed by whole-genome mapping to match sequence patterns between contigs, generating super-scaffolds. While whole-genome mapping reduces the misorientation errors and can place scaffolds over a relatively large distance, it is still mainly used as a verification tool, rather than the primary line of evidence used to produce chromosome-level genome references. With the increased availability and affordability of MPS technologies, there have been efforts to build de novo assemblies from short read data. Ancillary methods to validate and expand these assembles are becoming increasingly important in this endeavor, as many MPS assemblies show a marked reduction in quality and are dependent on the type of aligners used [14]. Given the relatively short sequence identity available to build contigs from MPS data, any nucleotide ambiguities can impact the alignment and affect the resulting assembly. Therefore, methods to detect incorrectly aligned scaffolds, to aid in creating the assembly, and to provide secondary verification of the assembly are important to improve these strategies. Long read approaches resolve some of the ambiguity in joining overlapping reads into contigs [15] but suffer from a higher nucleotide error rate that can mask overlapping regions between contiguous sequence, and still only cover a local region rather than the whole chromosome.
The single cell MPS technique Strand-seq offers an attractive orthogonal tool to refine and correct reference assemblies [4,16]. Strand-seq involves sequencing parental DNA template strands in single daughter cells and the method preserves the directionality of DNA. This is achieved by culturing cells in the presence of BrdU, a thymidine analogue that is incorporated exclusively into newly formed DNA strands. After cell division, single cell libraries are created and treated with a combination of Hoechst and UV to remove the newly formed strands, resulting in single-stranded library fragments containing template DNA only [17]. As replication is semi-conservative, the DNA template strands that are inherited into daughter cells are either the Watson (W, '-' or 3 -5 ) or the Crick (C, '+' or 5 -3 ) strand [18]. By maintaining this directionality, we previously showed that Strand-seq locates sister chromatid exchanges (SCEs) at unparalleled resolution, seen as a template strand switching from W to C or vice versa [4,17,19,20]. In addition, Strand-seq has been shown to have many applications including the mapping of polymorphic inversions [16], haplotyping [21,22], and studies of DNA repair in yeast [23] and humans [20]. These applications as well as the principle of genome assembly using Strand-seq data are illustrated in Figure 1. For the latter, the orientation of sequence reads is used to generate scaffolds, with sequence reads in each scaffold having either a WW, WC, or CC state in every cell that is sequenced ( Figure 1D) [24]. Strand-seq involves sequencing template strands. Parental homologues (pink and blue) are double stranded; Crick (C) strand in blue, Watson (W) strand in orange. DNA replication occurs in the presence of BrdU, which incor porates into the replicated strand (dotted lines). Sequencing libraries from single daughter cells have BrdU-containing strand selectively removed to generate directional chromosomes; either C WW (top) or WC (bottom) depending on segregation. Histograms of directional reads are plotted on ideograms for each chromosome. (B) When homologues inherit different template strands, haplotypes can be determined. In the example, all C reads map to the maternal homologue so all single nucleotide variants (SNVs) identified (black dots) form the maternal haplotype, and all W reads map to the paternal homologue, so all SNVs identified (white dots) form the paternal haplotype. (C) Structural variation can be identified in Strand-seq libraries. Inversions will align to th opposite strand of the reference assembly and as so be identified as a change in template strand state (D) Strand-seq can be used to create assemblies since contigs from the same chromosome w have the same template inheritance pattern. Grouping based on shared template inheritance patterns determines which fragments belong together. Note in the example contigs from ch1, chr3 , an chr5 have the same template pattern (WC) so require additional libraries to establish which conti belong to which chromosome.
Similarly, any changes in strand state within a scaffold either represents an SC event or an error where contigs have been incorrectly fused. When a strand state switc occurs at the same location in all libraries it can be delineated as an error, while an SC event will occur randomly. SCE events are important elements in creating Strand-se assemblies, as every scaffold downstream of an event will have a different state to ev rything upstream of an event (Supplementary Figure S1). Similar to meiotic recombin tion in genetic mapping approaches, this feature allows ordering of scaffolds alon chromosomes.
Previously, Strand-seq was used to resolve orientation errors in the GRCm37 a sembly to which the data were aligned [4]. In addition, we were able to map many of th remaining unlocalized and unplaced scaffolds from this assembly by matching the tem Histograms of directional reads are plotted on ideograms for each chromosome. (B) When homologues inherit different template strands, haplotypes can be determined. In the example, all C reads map to the maternal homologue so all single nucleotide variants (SNVs) identified (black dots) form the maternal haplotype, and all W reads map to the paternal homologue, so all SNVs identified (white dots) form the paternal haplotype. (C) Structural variation can be identified in Strand-seq libraries. Inversions will align to the opposite strand of the reference assembly and as so be identified as a change in template strand state (D) Strandseq can be used to create assemblies since contigs from the same chromosome will have the same template inheritance pattern. Grouping based on shared template inheritance patterns determines which fragments belong together. Note in the example contigs from ch1, chr3, and chr5 have the same template pattern (WC) so require additional libraries to establish which contigs belong to which chromosome.
Similarly, any changes in strand state within a scaffold either represents an SCE event or an error where contigs have been incorrectly fused. When a strand state switch occurs at the same location in all libraries it can be delineated as an error, while an SCE event will occur randomly. SCE events are important elements in creating Strand-seq assemblies, as every scaffold downstream of an event will have a different state to everything upstream of an event (Supplementary Figure S1). Similar to meiotic recombination in genetic mapping approaches, this feature allows ordering of scaffolds along chromosomes.
Previously, Strand-seq was used to resolve orientation errors in the GRCm37 assembly to which the data were aligned [4]. In addition, we were able to map many of the remaining unlocalized and unplaced scaffolds from this assembly by matching the template inheritance pattern of the fragments to the inheritance pattern of individual chromosomes [5]. Supporting data verified the presence of the misorientations identified by Strand-seq [4], and the Mouse Genome Reference Consortium incorporated this information into subsequent builds. For the human genome, orienting fragments in the reference assembly is complicated by common polymorphic inversions [16]. Nevertheless, using Strand-seq, we identified 41 reference assembly misorientations and/or minor alleles (allele frequency < 0.05) in GRCh37, which were distinguished from >100 polymorphic inversions found in unrelated individuals [16]. Strand-seq was also used to assemble haplotypes along the entire length of all chromosomes without generational information or statistical inference [21,22]. While we have utilized Strand-seq to correct polished assemblies, it is more complicated to align scaffolds together in the absence of a whole assembly map. However, our ability to successfully improve near complete assemblies motivated us to apply Strand-seq to other species with less complete, draft-quality genome builds.
Many organisms that are important for biomedical research have incomplete genome assemblies. Here, we have applied Strand-seq and the bioinformatics analysis package contiBAIT [24] to aid in refining the assemblies for six such organisms (Table 1). The organism statistics outline the cell line used, the number of Strand-seq libraries used in the study, and the expected number of chromosomes. The chromosome number was adjusted based on the expected allosomes for the gender of the cell line for each organism. The assembly statistics include the assembly that the Strand-seq libraries were aligned to, the (gapped) size of that assembly, and the proportion of scaffolds covered in the data (where applicable). The misorientation and chimera statistics highlight the number, genomic size, and proportion of the assembly affected by misorientations and chimeric fragments respectively.
To demonstrate the ability of Strand-seq to generate robust assemblies by clustering thousands of unconnected contigs, three organisms were selected with scaffold-stage assemblies at different levels of completeness. The ferret (M. putorius furo) assembly consists of 7783 unplaced scaffolds [25] and is an important model for studies of human respiratory diseases, including influenza infection and transmission. The assembly of the Tasmanian devil (S. harrissii) genome has been spearheaded to aid in studies of an atypical transmissible cancer, devil facial tumor disease, which is decimating the population. Currently this assembly contains 35,974 scaffolds placed to chromosomes, but without a specific order [26]. Finally, the Guinea pig (C. pocellus), is an important model organism used in the study of vaccines and the research and diagnosis of infectious diseases. This assembly consists of 3142 large unplaced scaffolds [27].
We further used Strand-seq to correct misorientations and incorrectly placed scaffolds in three chromosome-stage assemblies. The principle of this approach is based on arranging scaffolds into linkage groups ( Figure 2) and ordering them along the full length of each chromosome (Supplementary Figure S1). Of these six organisms we used for enhancing genome references, the pig (S. scrofa) was selected for its significance in agriculture and in medicine, as well as in understanding evolution during animal domestication. Most of the sequence (92%, 5344 scaffolds), has been ordered into the 20 chromosomes, with a further 4562 scaffolds remaining unplaced. However, this assembly still contains 69,541 spanned and 5323 unspanned gaps [28]. Since there is no underlying information on the orientation of scaffolds separated by unspanned gaps (which have no supporting evidence for the orientation of the contigs they flank), this would suggest that at least some of the scaffolds are incorrectly oriented. Genome references of many other important model organisms also built on the chromosome-level contain multiple gaps and unplaced fragments. For example, the zebrafish (D. rerio) is an important model in vertebrate development and gene function, and while the zebrafish assembly [29] (Zv9) is of high quality and mostly complete, it included 1107 unplaced fragments (55.4 Mb) and 3427 unspanned gaps. A further model Of these six organisms we used for enhancing genome references, the pig (S. scrofa) was selected for its significance in agriculture and in medicine, as well as in understanding evolution during animal domestication. Most of the sequence (92%, 5344 scaffolds), has been ordered into the 20 chromosomes, with a further 4562 scaffolds remaining unplaced. However, this assembly still contains 69,541 spanned and 5323 unspanned gaps [28]. Since there is no underlying information on the orientation of scaffolds separated by unspanned gaps (which have no supporting evidence for the orientation of the contigs they flank), this would suggest that at least some of the scaffolds are incorrectly oriented. Genome references of many other important model organisms also built on the chromosome-level contain multiple gaps and unplaced fragments. For example, the zebrafish (D. rerio) is an important model in vertebrate development and gene function, and while the zebrafish assembly [29] (Zv9) is of high quality and mostly complete, it included 1107 unplaced fragments (55.4 Mb) and 3427 unspanned gaps. A further model organism with a large research community, Xenopus (X. tropicalis), has an assembly with more unplaced fragments (6811, 167.9 Mb), but no unspanned gaps [30].

Results
For the six organisms studied, we built and used Strand-seq libraries from between 56 to 242 single cells per species (Methods and Table 1). Data were aligned to their respective assemblies and analyzed using the Bioconductor package contiBAIT [24] (Table 1). We also included previously published mouse [4] and human [16] Strand-seq datasets as positive controls. For all organisms, we were able to correct multiple errors that encompassed large regions of these assemblies (both between and within scaffolds). We achieved this by identifying two distinctive signatures that represent common errors that propagate within assemblies ( Figure 3). First, regions that showed consistent and complete reversal in template state for a portion of the scaffold were flagged as a misorientation (or as a polymorphic inversion between the cell line sequenced and the assembly). Next, regions that showed no inheritance similarity with the neighbouring sequence were identified as putative chimeras that arise from contig mis-joins such that portions of scaffolds are placed to the wrong chromosome. For the former, misoriented sequences were reoriented within the fragment and flagged as errors in the assembly (Figure 3). For the latter, chimeras were split at the mis-join site and independently clustered to identify the correct location of these fragments (an example chimeric scaffold is shown in Supplementary Figure S2).
Using the template inheritance as a bi-allelic marker for every scaffold in the respective assemblies, we devised a method to cluster scaffolds based on the expectation that those belonging to the same chromosome will show the same bi-allelic template pattern across multiple Strand-seq libraries [24]. To achieve this, all fragments from a single Strandseq cell were divided into one of three groups based on the inheritance patterns of their templates: WW, CC, or WC, and then grouped and ordered based on shared inheritance states between all fragments and across all cells (Figure 2). In this way, we were able to assign each scaffold to a linkage group (LG), where all scaffolds within the same LG belonged to the same physical chromosome. The software is able to account for the fact that assembly scaffolds may be in 5 -3 or 3 -5 orientation and reorients fragments into the same directions. These LGs are therefore equivalent to a 'super scaffold': they encompass many scaffolds and fragments that cluster together, are oriented in the same direction, and represent a draft chromosome ( Figure 2). Moreover, since the strand inheritance pattern is a feature of the entire chromosome, Strand-seq is able to resolve scaffold associations along entire chromosomes rather than at a megabase level.
For each scaffold assembly, the majority (>90%) of fragments clustered together into the same number of LGs as there are chromosomes from that organism (Figure 1). For example, for the ferret genome (20,XX), 97.9% of the assembly fragments mapped to the 20 largest LGs (Figure 4, Supplementary Figure S3). Each of these 20 groups represent scaffolds that have been correctly oriented and show co-inherited strand states, consistent with them belonging on the same chromosome. Similarly, 90.9% of Guinea pig (32,XX) scaffolds mapped to the 32 largest LGs (Supplementary Figures S3 and S4), and 90.4% of Tasmanian devil (7,XY) assembly fragments mapped to 7 LGs (Supplementary Figure S5). Since unlocalized and unplaced fragments are not tethered to whole chromosome scaffolds, the orientation of these fragments was expected to be mostly random. Figure 3. The effect of different assembly errors or structural variation on clustering. Different errors will generate characteristic patterns in the clustering data. Consider two scaffolds in close proximity on a chromosome, scaffold_1 and scaffold_2. (A) In a case where both scaffolds are oriented in the same direction, the scaffolds will have the same strand-state patterns. When comparing homozygous patterns (WW scaffolds against CC scaffolds), heterozygous patterns (WW or CC scaffolds against WC scaffolds) or comparing all three strand states against each other, there will be high similarity. (B) In the case of a misorientation (or a homozygous inversion), the strand-state patterns will be antithetical when comparing homozygous states, as whenever scaffold_1 is WW, scaffold_2 will be CC, and as such, these scaffolds will be completely dissimilar. However, since misorientations are not visualized in heterozygous inheritance patterns, when comparing WW or CC states against WC states, the scaffolds are highly similar. When comparing all three states against each other, the similarity seen with WC scaffolds and dissimilarity seen with WW or CC scaffolds will cancel out, resulting in ~50% similarity. (C) In cases of a heterozygous inversion, either scaffold_1 or scaffold_2 may have a homozygous state, but not both. Therefore, no comparisons can be made when only considering the homozygous states, and NA values are generated. There will, however, be a high degree of dissimilarity when comparing homozygous and heterozygous states. It is important to distinguish these natural structural variants from assembly reference errors. (D) In cases where a scaffold is incorrectly located to a chromosome (i.e., a chimera), the inheritance pattern between the two scaffolds will be random, and there will be no significant similarity or dissimilarity between these scaffolds.
For each scaffold assembly, the majority (>90%) of fragments clustered together into the same number of LGs as there are chromosomes from that organism (Figure 1). For example, for the ferret genome (20,XX), 97.9% of the assembly fragments mapped to the 20 largest LGs (Figure 4, Supplementary Figure S3). Each of these 20 groups represent scaffolds that have been correctly oriented and show co-inherited strand states, consistent with them belonging on the same chromosome. Similarly, 90.9% of Guinea pig (32,XX) scaffolds mapped to the 32 largest LGs (Supplementary Figures S3 and S4), and 90.4% of Tasmanian devil (7,XY) assembly fragments mapped to 7 LGs (Supplementary Figure S5). Since unlocalized and unplaced fragments are not tethered to whole chromosome scaffolds, the orientation of these fragments was expected to be mostly random. The effect of different assembly errors or structural variation on clustering. Different errors will generate characteristic patterns in the clustering data. Consider two scaffolds in close proximity on a chromosome, scaffold_1 and scaffold_2. (A) In a case where both scaffolds are oriented in the same direction, the scaffolds will have the same strand-state patterns. When comparing homozygous patterns (WW scaffolds against CC scaffolds), heterozygous patterns (WW or CC scaffolds against WC scaffolds) or comparing all three strand states against each other, there will be high similarity. (B) In the case of a misorientation (or a homozygous inversion), the strand-state patterns will be antithetical when comparing homozygous states, as whenever scaffold_1 is WW, scaffold_2 will be CC, and as such, these scaffolds will be completely dissimilar. However, since misorientations are not visualized in heterozygous inheritance patterns, when comparing WW or CC states against WC states, the scaffolds are highly similar. When comparing all three states against each other, the similarity seen with WC scaffolds and dissimilarity seen with WW or CC scaffolds will cancel out, resulting in~50% similarity. (C) In cases of a heterozygous inversion, either scaffold_1 or scaffold_2 may have a homozygous state, but not both. Therefore, no comparisons can be made when only considering the homozygous states, and NA values are generated. There will, however, be a high degree of dissimilarity when comparing homozygous and heterozygous states. It is important to distinguish these natural structural variants from assembly reference errors. (D) In cases where a scaffold is incorrectly located to a chromosome (i.e., a chimera), the inheritance pattern between the two scaffolds will be random, and there will be no significant similarity or dissimilarity between these scaffolds. Our data supported this, showing there were approximately equal numbers of unlocalized and unplaced fragments represented in each direction ( Figure 5B). Using the same methodology, we were able to locate many of the unlocalized fragments present in the chromosome-stage assemblies for the pig, zebrafish, and Xenopus. Misorientations were identified in all assemblies, though to varying degrees ( Figure 5, Table 1). By conventional methodologies, orienting contiguous sequences flanked by gaps has been difficult, with BAC end sequencing being the primary approach to bridging these gaps. It was therefore not unexpected that the majority of misoriented scaffolds we identified occurred between assembly gaps. However, misorients were also identified within contiguous sequences, albeit at a lower rate. For example, we discovered 578 misoriented regions in the zebrafish assembly Zv9 (56.8 Mb, 4.19% of the assembly), but only 22 of these were not flanked by gaps. To investigate our ability to correctly orient scaffolds using Strand-seq, we performed BioNano optical mapping and shotgun sequencing on a separate zebrafish cell line and compared scaffolding calls. More than 97% of misorientations identified by Strand-seq were cross validated by at least one orthologous technique. Of these, 240 (41%) were identified in shotgun sequenced clones, and 256 (44%) were identified through BioNano optical mapping. Based on these data, our Strand-seq results were included as a validation method, and the misorientations identified were identified, assessed, actioned, and amended in the GCRz10 build of this genome reference. Examples demonstrating the high concordance between optical and Strand-seq data both for misorientations and in localising unlocalized fragments are shown in Supplementary Figure S6. Our data supported this, showing there were approximately equal numbers of unlocalized and unplaced fragments represented in each direction ( Figure 5B). Using the same methodology, we were able to locate many of the unlocalized fragments present in the chromosome-stage assemblies for the pig, zebrafish, and Xenopus. Misorientations were identified in all assemblies, though to varying degrees ( Figure 5, Table 1). By conventional methodologies, orienting contiguous sequences flanked by gaps has been difficult, with BAC end sequencing being the primary approach to bridging these gaps. It was therefore not unexpected that the majority of misoriented scaffolds we identified occurred between assembly gaps. However, misorients were also identified within contiguous sequences, albeit at a lower rate. For example, we discovered 578 misoriented regions in the zebrafish assembly Zv9 (56.8 Mb, 4.19% of the assembly), but only 22 of these were not flanked by gaps. To investigate our ability to correctly orient scaffolds using Strand-seq, we performed BioNano optical mapping and shotgun sequencing on a separate zebrafish cell line and compared scaffolding calls. More than 97% of misorientations identified by Strand-seq were cross validated by at least one orthologous technique. Of these, 240 (41%) were identified in shotgun sequenced clones, and 256 (44%) were identified through BioNano optical mapping. Based on these data, our Strand-seq results were included as a validation method, and the misorientations identified were identified, assessed, actioned, and amended in the GCRz10 build of this genome reference. Examples demonstrating the high concordance between optical and Strand-seq data both for misorientations and in localising unlocalized fragments are shown in Supplementary Figure S6. Note that all chromosome-level assemblies displayed multiple orientation errors. The chimeric fragment within zebrafish is derived from an inverted region in the AB strain with respect to the Tübingen assembly [31], while misorients in the mouse were identified previously [4], and chimeras and misorients identified in the human sample correlated with previously identified heterozygous and homozygous inversions respectively [16]. (B) Barplot of scaffold orientation within each assembly. The predominant orientation of scaffolds within the assembly is set as correct ("+strand", grey), and the frequency of scaffolds that do not match this orientation is calculated. Misorients are subdivided into entire scaffolds that are in the opposite orientation to the majority of assembly scaffolds (dark green), and fragments within contiguous sequence that are in the incorrect orientation (purple). Chimeric fragments (green) are defined as portions of contiguous sequence that display a different template strand inheritance pattern and are therefore likely placed to an incorrect chromosome. The proportion of incorrectly oriented scaffolds constitute half of the scaffold-level assemblies. Chromosome-and complete-level assemblies have fewer scaffolds (higher N50 values), so most assembly errors occur within contiguous sequences.
Similar observations were made with the other chromosome assemblies: the pig reference (Sscrofa10.2) exhibited a greater degree of misorientation than the zebrafish assembly, with 1514 fragments (500.18 Mb, 17.81%) identified within the chromosome scaffolds. In addition, 96 chimeric fragments were discovered (24.73 Mb), split , and relocalized. For the Xenopus assembly, 140 misorientations were found (269.29 Mb, 18.67%) and 63 regions were flagged as chimeric. Using these data, we generated refined versions for each assembly, and after realigning, all Strand-seq reads were in the correct direction (Supplementary Figure S7). The quality of the scaffold-stage assemblies studied varied markedly based on misorientation and chimerism analysis ( Figure 5, Table 1). For the Guinea pig reference, 18 putative chimeras were detected, while 45 misorientations (197.21 Mb, 7.24%) within the scaffolds were found. Fewer misorientations were seen in the ferret assembly, with 35 identified (25.97 Mb, 1.08%), while 61 chimeras were detected. Finally, we identified 1675 putative misorientations in the Tasmanian devil assembly (13.0 Mb, 0.41%) and a further 1484 putative chimeras ( Table 1).
As a final application of Strand-seq, we were able to organise scaffolds into a relative order within LGs. Using SCEs that naturally arise in single libraries and occur randomly during replication [4], the template strand similarity between scaffolds from multiple libraries will progressively diminish the further apart they are in physical distance, as the likelihood of SCEs occurring between them increases. In this way, our approach is similar to classical linkage mapping, where genetic distance can be inferred as a function of the Percentage of assembly fragments classified as misorients or chimeras. Horizontal lines represent the sizes of each error within the assembly. Note that all chromosome-level assemblies displayed multiple orientation errors. The chimeric fragment within zebrafish is derived from an inverted region in the AB strain with respect to the Tübingen assembly [31], while misorients in the mouse were identified previously [4], and chimeras and misorients identified in the human sample correlated with previously identified heterozygous and homozygous inversions respectively [16]. (B) Barplot of scaffold orientation within each assembly. The predominant orientation of scaffolds within the assembly is set as correct ("+strand", grey), and the frequency of scaffolds that do not match this orientation is calculated. Misorients are subdivided into entire scaffolds that are in the opposite orientation to the majority of assembly scaffolds (dark green), and fragments within contiguous sequence that are in the incorrect orientation (purple). Chimeric fragments (green) are defined as portions of contiguous sequence that display a different template strand inheritance pattern and are therefore likely placed to an incorrect chromosome. The proportion of incorrectly oriented scaffolds constitute half of the scaffold-level assemblies. Chromosome-and complete-level assemblies have fewer scaffolds (higher N50 values), so most assembly errors occur within contiguous sequences.
Similar observations were made with the other chromosome assemblies: the pig reference (Sscrofa10.2) exhibited a greater degree of misorientation than the zebrafish assembly, with 1514 fragments (500.18 Mb, 17.81%) identified within the chromosome scaffolds. In addition, 96 chimeric fragments were discovered (24.73 Mb), split, and relocalized. For the Xenopus assembly, 140 misorientations were found (269.29 Mb, 18.67%) and 63 regions were flagged as chimeric. Using these data, we generated refined versions for each assembly, and after realigning, all Strand-seq reads were in the correct direction (Supplementary Figure S7). The quality of the scaffold-stage assemblies studied varied markedly based on misorientation and chimerism analysis ( Figure 5, Table 1). For the Guinea pig reference, 18 putative chimeras were detected, while 45 misorientations (197.21 Mb, 7.24%) within the scaffolds were found. Fewer misorientations were seen in the ferret assembly, with 35 identified (25.97 Mb, 1.08%), while 61 chimeras were detected. Finally, we identified 1675 putative misorientations in the Tasmanian devil assembly (13.0 Mb, 0.41%) and a further 1484 putative chimeras ( Table 1).
As a final application of Strand-seq, we were able to organise scaffolds into a relative order within LGs. Using SCEs that naturally arise in single libraries and occur randomly during replication [4], the template strand similarity between scaffolds from multiple libraries will progressively diminish the further apart they are in physical distance, as the likelihood of SCEs occurring between them increases. In this way, our approach is similar to classical linkage mapping, where genetic distance can be inferred as a function of the number of SCEs between two fragments (Supplementary Figure S1). Since chromosomal locations of all scaffolds had already been determined, we ordered these fragments based on SCE within each chromosome (Supplementary Figure S1). All data are included as bed files, which encompass the distinct LGs and order of fragments for each scaffold assembly, along with the directionality of all fragments for both scaffold and chromosome-level assemblies (Supplementary Materials).

Discussion
The quality of genome assemblies is determined by the methods employed to build them, the algorithms used to create contigs and chromosomes, and the complexity of the genome. Genomes with high levels of repetitive elements have the potential to be assembled erroneously resulting in fused chimeric contigs, and genomes with segmental duplications can be collapsed or overrepresented as multiple copies [32]. Algorithms used to build contigs from overlapping sequences can vary wildly [14], often resulting in chimeric contigs which may be retained in future builds.
Our results show that the quality of each original assembly is highly variable, which likely derives from the complexity of the genome, the type of technologies used for sequencing/scaffolding, and the algorithms used to build the assemblies [33]. Moreover, while model organisms often have homogeneous genomes due to inbreeding, the genetic heterogeneity of outbred organisms complicate and confound assembly strategies. Nucleotide variation can interfere with the joining of contigs, but, more drastically, large polymorphic structural variation can impede the ability to create a reliable assembly (Figure 3). This kind of structural variation is prevalent within the human population, where 1.2% of the genome (34.91 Mb) represents regions in which polymorphic inversions have been detected [16]. It is possible that sequencing a variety of outbred animals and creating a composite assembly will therefore result in conflicting scaffold joins, with inter-animal structural variation confusing the orientation and location of fragments. As such, the hybrid approach used for the pig assembly may explain the large degree of misorientations we observed. Here, the data, primarily derived from a female Duroc sow, were combined with sequences from four other porcine breeds; Large White, Meishan, Yorkshire, and Landrace [28]. The AB zebrafish cell line used in our study was from a different strain than was used for the Tübingen assembly [31], and we identified a previously described [31] polymorphic pericentromeric chromosomal inversion on chromosome 3 (chr3:46,945,080-56,227,809, data not shown). While we are unable to exclude the possibility that the assembly misorientations identified in our study are homozygous polymorphic inversions, other methods including de novo assembly through sequencing are also not immune to this issue. Furthermore, while heterozygous inversions can resemble contig mis-joins, they can be resolved since they display unique patterns within our data (Supplementary Figure S7). By combining this approach with Strand-seq haplotyping [21,22], we will be able to further resolve and phase these structures during the assembly process, although an initial assembly with which to align is still necessary.
Using Strand-seq we have developed a novel approach to building assemblies that is completely independent of overlapping contigs. This approach can rapidly locate and localize fragments with as little as a single lane of a sequencing run. The ability to improve reference assemblies using common sequencing platforms is an advantage of Strand-seq over orthologous methods that require specialized equipment such as longread sequencing methods and optical mapping. Furthermore, these results highlight that Strand-seq can assess contiguous sequences using multiple reads spread across fragments, and as such can readily identify incorrect contig mis-joins. This approach has more in common with traditional genetic mapping strategies than standard assembly approaches, and can be applied to assemblies at the contig, scaffold, chromosome, or complete stages. By identifying the order of scaffolds, this method will further aid in efforts to sequence across gaps using targeted PCR-based or long-read strategies. Collectively, we show that this approach simultaneously stratifies, orients, and corrects assemblies. As the field relies more and more on computational assembly building from shorter massively parallel sequence reads, the opportunity for incorrect dovetail joining of overlaps to introduce chimeric contigs is increased. Taken together, our results show that Strand-seq is an effective approach for improving genome assemblies by allowing, in combination with other sequencing methods, to immediately correct, orient, and link fragments together.

Cell Culture
All cell lines were obtained from the American Type Culture Collection (ATCC, Manassas, VA 20110 USA), with the exception of the diploid X. tropicalis Speedy cell line [34] which was a kind gift from Nicolas Pollet (Paris, France). The Guinea pig cell line 104C1 was cultured in RPMI1640 (Gibco, Thermo Scientific, Waltham, MA, USA, 02451-02454) supplemented with 10% FCS (Hyclone, Thermo Scientific). Cells were cultured at 37 • C with a media change every 3 days. Cells were passaged using 0.25% (w/v) trypsin and 0.03% (w/v) EDTA for 5 min with a 1:4 split ratio. The ferret cell line Mpf was cultured in Eagle MEM (Gibco) supplemented with Earle's Balanced Salt Solution (Sigma Sigma-Aldrich Canada Co., Oakville, ON, Canada) and 15% lamb serum, at 37 • C with media renewal every 3 days. Cells were passaged by rinsing with PBS then dissociating with Trypsin/EDTA solution for 10 min with a 1:5 split ratio. The pig cell line SK-RST was cultured in Eagle's MEM with 10% FCS at 37 • C with media renewal every 2 days. Cells were passaged with Trypsin/EDTA for 10 min and subcultured with a 1:6 split ratio. The zebrafish AB.9 cell line was cultured in DMEM supplemented with 15% heat-inactivated FBS at 28 • C with a media change every 2 days. Cells were passaged using 0.25% (w/v) Trypsin, 0.53 mM EDTA, 0.5% PVP solution for 8 min. Cells were subcultured at a 1:3 ratio. The Xenopus Speedy cell line was grown in 67% (v/v) L15 medium adjusted to amphibian osmolarity by diluting with sterile water, with 10% heat inactivated FBS at 28 • C with media renewal every 3 days. All cells were grown at constant humidity in normoxic conditions.

Preparation of Strand-Seq Libraries
Preparation of strand-seq libraries were performed according to the previously reported protocol [17]. Sequencing reads were aligned using bwa to the most recent available assembly for each organism (listed in Table 1) and compressed into bam files [17]. Chromosome ideograms were plotted and misorients were identified using the Bioconductor package ContiBAIT [24].

Orthologous Curation Methods
Optical maps for the Zebrafish genome were produced using the BioNano Irys system [35]. The data were aligned using RefAligner and displayed in the Genome Evaluation Browser (gEVAL) [36], alongside other data types for ease of assembly quality assessment. The Strand-seq observations were validated against the collection of aligned data and changes incorporated into the assembly release following the established curation routines of the Genome Reference Consortium [35,37].