Physlr: Next-Generation Physical Maps

While conventional physical maps helped build most of the reference genomes we use today, generating the maps was prohibitively expensive, and the technology was abandoned in favor of whole-genome shotgun sequencing (WGS). However, genome assemblies generated using WGS data are often less contiguous. We introduce Physlr, a tool that leverages long-range information provided by some WGS technologies to construct next-generation physical maps. These maps have many potential applications in genome assembly and analysis, including, but not limited to, scaffolding. In this study, using experimental linked-read datasets from two humans, we used Physlr to construct chromosome-scale physical maps (NGA50s of 52 Mbp and 70 Mbp). We also demonstrated how these physical maps can help scaffold human genome assemblies generated using various sequencing technologies and assembly tools. Across all experiments, Physlr substantially improved the contiguity of baseline assemblies over state-of-the-art linked-read scaffolders.


Introduction
A genome contains genetic instructions for the development, functioning, growth, and reproduction of any known living organism (or virus). Thus, a primary step for many bioinformatics studies is to determine the order of nucleotides in the genome. Despite the rapid technological advances of the past few years, sequencing instruments can generate comparatively short sequence readouts (sequencing reads) redundantly sampled from long target molecules. De novo genome assembly aims to reconstruct the entire genome sequence from overlapping sequencing reads and enable a wide range of downstream studies [1][2][3].
In the Sanger sequencing era, hierarchical shotgun sequencing was the dominant approach for de novo sequencing and assembly of large genomes (Figure 1a). In this approach, following the initial preparation of large-insert clones, an engineered collection of restriction enzymes was used to cut these molecules into sequence fragments, generating unique fingerprints for each. After measuring these fingerprints by gel electrophoresis, overlaps between molecules were assessed, and a physical map of molecules was created. The resulting map helped distribute the sequencing process and enabled the independent assembly of each molecule. The structure of the physical map then guided the scaffolding of the individually assembled pieces. Due to the use of clones in conventional physical maps, this approach is also known as clone-by-clone or map-based sequencing [4][5][6]. (a) Hierarchical shotgun sequencing uses physical maps to independently sequence and assemble selected clones and scaffold those assemblies to reconstruct the underlying genome. (b) Whole-genome shotgun sequencing involves a fast and automated library preparation of DNA fragments (molecules) followed by high-throughput sequencing. Physlr introduces next-generation physical maps to this domain and constructs a map of molecules using reads. The resulting map enables various genomic data analyses, including scaffolding of draft assemblies.
At the turn of the century, this approach facilitated the production of high-quality reference genomes for several model organisms [7][8][9][10], enabling a broad range of studies across multiple fields [11,12]. However, this method was labor-intensive, tedious, costly, and time-consuming (13 years for the Human Genome Project [13]). Even in the earlier days of the genomics era, to obtain more affordable human genome assemblies, Venter and colleagues [14] employed whole-genome shotgun sequencing (WGS) [15]. WGS shears the entire genome randomly and combines all the resulting DNA fragments (unordered) for sequencing and assembly (Figure 1b top portion, excluding the Physlr insert). At the crossroad of genome mapping and sequencing technologies, some bioinformatics methodologies combined WGS and the long-range information provided by physical maps to improve the contiguity of WGS genome assemblies at run time [16].
In line with the community's demand for affordable technologies [17], WGS on highthroughput sequencing platforms gradually replaced hierarchical sequencing [18]. While this resulted in rapid growth in de novo sequencing studies, the resulting assemblies were often highly fragmented. To compensate, a diverse range of sequencing technologies have emerged to provide long-range genomics context to genome assemblies, including optical maps [19,20], jumping (mate-pair) sequencing libraries [21], synthetic long reads [22], linked reads [23], long reads [24], and genetic maps [25]. However, not all current pipelines routinely achieve high contiguity [26,27].
This study focuses on leveraging the information content of linked reads for de novo assembly. Linked-read sequencing improves upon short-read sequencing, tagging each short read with a barcode identifier (Supplementary Figure S1). This barcode is identical for all reads originating from the same long molecule (~100 kb), indicating that they come from the same neighborhood in the genome. However, reads from multiple molecules may share the same barcode (barcode reuse) and complicate downstream analyses. Another challenge arises from the fact that molecules are typically sequenced partially (sub-1× coverage). Therefore, the sequence of a single molecule cannot be reconstructed independently. It is feasible, however, to decipher its underlying sequence with redundant molecule sampling.
The error rate of linked reads is much lower than that of long sequencing reads and linked-read molecules tend to span longer regions.
While an early implementation of the linked-read paradigm developed by 10x Genomics (10xG) has been discontinued, more recent developments by MGI (MGIEasy stLFR [28]) and Universal Sequencing (TELL-Seq [29]) continue to offer great potential for de novo sequencing projects. However, we note that barcode reuse remains a challenge for all these platforms, albeit to varying degrees.
On the 31st anniversary of the Human Genome Project [10], we revisited the concept of physical maps employed in that project. We introduce Physlr, a next-generation physical mapping tool using whole-genome sequencing reads (Figure 1b, Physlr insert). Physlr uses the long-range information provided by linked reads to infer a chromosome-scale physical map represented by an overlap graph of the sequenced molecules. This physical map can be used in downstream applications, such as genome assembly/scaffolding, misassembly correction, structural variant detection, and haplotype phasing. Here, we demonstrate how physical maps can be used to generate more contiguous assemblies compared with those produced by the current state-of-the-art linked-read scaffolding tools.

Overview of the Pipeline
Physlr runs in two stages: (a) constructing a de novo physical map of DNA fragments (molecules) from which linked reads are generated, and (b) scaffolding a draft genome assembly using the physical map ( Figure 2a). It accepts linked reads and a draft assembly (from any sequencing technology) as input to its first and second stage, respectively. In the first stage, Physlr builds a molecule overlap graph. However, linked reads provide information about barcodes rather than molecules. Thus, Physlr first constructs a barcode overlap graph, where a vertex represents a barcode and a weighted edge represents sequence similarity between the reads of the two barcodes (weight being the number of shared minimizers). The topology of communities (sets of vertices densely connected internally and loosely connected externally) around each vertex harbors information about the molecules associated with a given barcode (Figure 2b). We employ a novel algorithm to detect communities and transform the barcode overlap graph into a molecule overlap graph. Next, a representative path, a physical map akin to a golden path used in the Human Genome Project [4,10], is computed and outputted. In the second stage, Physlr uses the physical map to order and orient contigs of the input draft assembly into scaffolds.

Physlr Implementation, Stage 1: Constructing Physical Maps
In generating physical maps, Physlr transforms linked reads to a barcode graph, then to a molecule graph, and finally to the output physical map (Figure 2a). To mask repetitive sequences efficiently, Physlr first uses ntCard [30] to estimate the cardinality of k-mers (subsequences of length k) of the input data and ntHits [31] to generate a Bloom filter of repetitive k-mers. Subsequently, it generates k-mer minimizer sketches from the reads [32], avoiding minimizers from repeat k-mers found in the ntHits Bloom filter (Figure 2a, indexlr). Accordingly, each barcode is associated with a bag of k-mer minimizers derived from reads of that barcode.
Physlr then constructs a barcode overlap graph wherein vertices represent barcodes and edges connect barcodes sharing a minimum number of minimizers. In an ideal scenario without barcode reuse or repeats, we expect this graph to be composed of multiple connected components, one per chromosome ( Figure 2e). We also expect each component to be a long linear graph, or, in graph theory terminology, a graph with a small radius and a large diameter, the former scaling with the average number of molecules covering each base position on the target genome, and the latter scaling with the chromosome length ( Figure 2, panels c and e). However, in reality, the graph is more complex due to each set of molecules sharing the same barcode being collapsed into a single vertex ( Figure 2, panels b and d).
To transform the barcode graph into a molecule graph, Physlr iterates over all vertices and extracts a neighborhood subgraph for each barcode (Figure 2b). Molecules associated with each barcode originate from independent genomic loci and overlap with different sets of barcodes, thereby creating new communities for the subgraph. Physlr deconvolutes each barcode into molecules using a novel community detection algorithm.
In summary, we randomly split the neighborhood subgraph of a barcode into bins (Supplementary Figure S2). For each bin, we detect biconnected components, and for each component, we calculate the cosine similarity between pairs of vertices. Edges connecting less similar vertices are removed, and connected components of the subgraph are returned as a community. To avoid over-splitting, we compare these communities and merge them if they share enough edges. The algorithm is discussed in further detail in Section 3.
A small fraction of reused barcodes may remain unresolved. Because we expect a small radius and large diameter for molecule overlap graphs, in a given maximum spanning tree (MST) of such graphs, every vertex can possess many short branches and at most two arms (long branches). However, unresolved barcodes convolute the graph, which manifest as vertices with more than two arms in the MST. Physlr employs a linear-time belief propagation algorithm [33] to inform all vertices in the MST on how many arms they possess.
In general, for a vertex-edge pair u and (u, v) in a given tree, this algorithm calculates is a defined property of the vertex u, and f () is a function. The algorithm starts from leaves in the tree and calculates a belief b {u,(u,v)} only when all beliefs required for the calculation are given; then, it passes the belief to vertex v through edge (u, v). As a result, all beliefs are calculated and propagated by passing two messages (beliefs) through each edge of the tree. As the number of edges in a tree is in the same order as the number of vertices, this algorithm runs in linear-time complexity. In our case After informing vertices about the size of their branches, Physlr flags those with more than two arms as junctions. As they form only a small fraction of barcodes (hundreds out of millions), Physlr reassesses them for community detection with increased sensitivity without heavy computational cost. If unresolved barcodes remain, Physlr removes them from the map. Finally, Physlr calculates an MST of the molecule graph as a representative path (physical map or backbone). This backbone is comprised of ordered lists of molecules, each containing a set of minimizers.

Physlr Implementation, Stage 2: Scaffolding Draft Assemblies
In the second stage, input draft assemblies are scaffolded using a minimizer-based mapping to the backbone. Because minimizers still associate with barcodes rather than backbone molecules, Physlr assigns barcode minimizers to their associated molecules by selecting the common minimizers shared with neighboring molecules in the molecule graph. We then map the input assembly to the backbone by comparing the minimizers of each sequence to the molecule-assigned minimizers of each molecule.
Finally, we employ ARCS (in ARKS mode) [34,35] to order and orient the input assembly contigs relative to the backbone. ARCS orients scaffold targets by tallying the number of barcodes that share k-mers with both contig ends for each scaffold orientation (head-head, head-tail, tail-head, and tail-tail) and selecting the orientation with the highest number of supporting barcodes. Physlr also extracts distance estimations between scaffold targets calculated by ARCS to report the number of undetermined bases in scaffold gaps.

Evaluations
We ran Physlr with stLFR sequencing data from two human (Homo sapiens) cell lines, NA12878 and NA24143. Reads were downloaded from Genome in a Bottle (Supplementary  Table S1) and were reformatted to include barcodes in their headers (standard 10xG linkedread format using BX:Z tag). Physlr provides default parameter settings for various linkedread technologies with the "protocol" option, which controls percent edges (of lower-edge weights) removed from the barcode overlap graph. For the test runs, we set protocol = stlfr (removes 15% of weak overlaps) and used 48 threads.
Physlr builds physical maps de novo but can accept a reference genome to map the minimizers associated with molecules and calculate the length of its physical map in base-pair coordinates for evaluation purposes. This calibration enabled the generation of ideograms ( Figure 3) and calculation of NG50 values (maximum length that at least 50% of the target assembly length is in pieces at least this length). In our tests, we used the human genome build GRCh38 (excluding chromosome Y because both cell lines originated from female individuals) as the reference. We benchmarked Physlr against ARCS (setting the -arks option) and SLR-Superscaffolder (SLR-SS) [36], two state-of-the-art linked-read scaffolding tools. Genomes of the two cell lines (NA12878 and NA24143) were sequenced using four technologies: MGI stLFR, Illumina paired-end and mate-pair (PE+MPET), Oxford Nanopore Technology (ONT), and Pacific Biosciences (PacBio) (Supplementary Table S1). We assembled the stLFR data (same data as in the first stage of Physlr) using Supernova [37] and the PE+MPET data using ABySS 2 [38]. We used Shasta [39] assemblies of the ONT data and Falcon [40] assemblies of the PacBio data, available from Genome in a Bottle (GIAB). We used QUAST [41] to calculate quality metrics for the genome assemblies, from which we visualized the number of misassemblies against the NGA50-NG50 range (NGA50: similar to NG50, but considers alignment blocks instead of contig lengths) as proxies to correctness and contiguity, respectively (showcased in Figure 4). We used GRCh38 as a reference for QUAST; thus, some reported misassemblies were likely genome-specific structural variants for given individuals. More information about assemblies, software versions, and parameters are provided in Supplementary Tables S2-S4. Genome assembly consistency (Jupiter) plots [42], based on Circos [43], enable a visual evaluation of genome assemblies. The tool plots a draft assembly against a reference assembly on a circle circumference and connects aligned blocks with ribbons. A highquality assembly results in well-ordered and well-oriented syntenic blocks, while largescale misassemblies are apparent as crossing ribbons. Figure 5 shows a set of Jupiter plots for the NA12878 assemblies. Here, for each assembly, we calculated the N75 (maximum length that at least 75% of the total assembly length is in pieces at least this length) and L75 (number of scaffolds with length at least N75). Next, for each ternary comparison (each row in Figure 5 comparing baseline, ARCS, and Physlr), we found the minimum of L75s, min-L75, and we plotted only the top min-L75 longest scaffolds for all. For example, Physlr minimized the L75 of all assemblies for the ONT (third row) at 37-hence, the 37 longest scaffolds for all ONT experiments were shown. In other words, we plotted the same number of longest pieces for all assemblies in comparison, and thus a higher proportion of the circle is covered with ribbons for an assembly with higher N75. In the middle of each plot, we also show the percentage of the genome covered by the plotted pieces. As a result, one can quickly compare the assemblies by considering (a) the ribbon coverage, (b) the size of assembly pieces on the right side, and (c) the extent of crossing ribbons (misassemblies) for each circle. . Assembly quality metrics for scaffolding eight human assemblies. Each pair of horizontal lines connected vertically shows a range for NGA50-NG50 of an assembly. Each column corresponds to a sequencing technology (and genome assembly tool, indicated in parentheses) used to generate the baseline assembly, and each row corresponds to a human individual. For each experiment, we evaluated a baseline assembly against scaffolding outputs of Physlr, ARCS [34], and SLR-Superscaffolder [36]. (MPET: Illumina mate-pair).

Figure 5.
Jupiter plot visualizations for NA12878 assemblies against reference. Each row contains various assemblies (baseline, ARCS [34], and Physlr) for a specific technology and illustrates only a certain number of top largest scaffolds: minimum L75 (min L75) of assemblies in the row (labelled under Physlr, which had the minimum L75 for each row). Physlr consistently presented larger pieces and higher ribbon coverage while keeping crossing ribbons at a low rate. We show ribbon coverage (the percentage of reference covered with min L75 scaffold sequences) in the middle of each plot.

Constructing Physical Maps
We generated physical maps for two human cell lines using stLFR linked reads (Section 2.4. Evaluations). A physical map of molecules comprises multiple connected components-one per chromosome in the best-case scenario. In assessing the contiguity of the maps, we found the minimum number of graph components that contained at least 75% of the vertices (molecules) were 32 and 44 for NA12878 and NA24143, respectively. This suggests that, at this length cutoff, Physlr produced nearly two components per chromosome, on average. Next, we converted the coordinates to base pairs and calculated the NGA50 values of the maps as 52.4 Mbp and 70.49 Mbp for NA12878 and NA24143, respectively.
As seen in the ideograms generated from the NA24143 physical map (Figure 3), five chromosomes were covered by a single piece, with the physical maps spanning over centromeres for chromosomes 6 and 19. Nine other chromosomes were covered with two backbone pieces each (one backbone per chromosome arm). The most fragmented chromosome was chromosome 5, covered with five backbone pieces. Similar results were observed for the physical map for NA12878, as shown in Supplementary Figure S3.
To demonstrate that Physlr is robust to changes in the linked-read technology, we also built physical maps using 10xG Chromium data for NA24143. The resulting physical map had an estimated NGA50 of over 70 Mbp (Supplementary Results).

Scaffolding Draft Assemblies
We evaluated the potential value of physical maps generated by Physlr in scaffolding draft assemblies (Section 2.4. Evaluations). Physlr increased the NG50 and NGA50 across all experiments (Figure 4). For example, it improved the NG50 and NGA50 values of two short-read assemblies (PE+MPET) by over 45-fold and 12-fold, respectively, and increased the number of misassemblies by less than 1-fold on average. In another example, Physlr improved a long-read (ONT) assembly of NA12878: 7.9/2.8 -fold change (54.4/17.3 Mbp) in NG/NGA50 with less than 0.3-fold increase in misassemblies. For the same experiment, Physlr outperformed ARCS with a 2.1/1.2-fold change in NG/NGA50 values.
Overall, Physlr-scaffolded assemblies reached NGA50s of up to 21.7 Mbp and 22.3 Mbp for NA12878 and NA24143, respectively. The NG50 improved consistently for all Physlr assemblies and ranged between 54-60 Mbp, and 78-80 Mbp, respectively. However, the NG50s of ARCS and SLR-SS scaffolded assemblies varied in wider ranges for different technologies. While Physlr improved the contiguity of assemblies over other tools, the number of misassemblies was either lower or marginally higher in respect to the substantial increase in the contiguity. All tools only slightly enhanced the NGA50 of the Supernova baseline assemblies; these baseline assemblies contained more errors compared to other baseline assemblies.
To better visualize the metrics in Figure 4 for the NA12878 assemblies, we generated Jupiter plots ( Figure 5). Each row shows a baseline assembly, its scaffolding with Physlr, and its scaffolding with ARCS. For all rows, Physlr achieved the lowest L75s, maximized the ribbon coverage, contributed larger contigs, and produced only a few inconsistent ribbons, all of which suggest a better performance as discussed above (Section 2.4. Evaluations).
We also successfully scaffolded the same set of baseline assemblies for NA24143 using a physical map of 10x Genomics linked reads and presented outcomes in Supplementary Results (Physlr for 10xG Chromium data) and Supplementary Table S6.

Deconvoluting Barcodes via Community Detection
Barcode reuse is a fundamental challenge for linked-read technologies. While all other reference-free tools ignore barcode reuse (except for Minerva [44], only applicable for metagenomics), Physlr uses long-range information aggregated in the graph to split barcodes into molecules. Each vertex b i in the barcode graph comprises multiple hidden molecules mol i,1 to mol i,M (Supplementary Figure S4). Each constituent molecule mol i,m originates from a different region in the genome. Thus, each molecule tends to overlap with a different set of molecules, all of which originate from the same genomic site. Consequently, mol i,m connects vertex b i to barcodes C i,m , a set (community) of barcodes that are highly connected internally because they contain molecules from the same genomic region. In other words, C i,m is a strong community of barcodes adjacent to b i through mol i,1 . Ultimately, b i is connected to multiple communities C i,1 to C i,M , which are strongly connected internally and weakly connected to one another, as each tends to originate from a different region. Thus, we can deconvolute a barcode-vertex b i by detecting community patterns (C i,1 to C i,M ) in the neighborhood subgraph. Figure 2b illustrates one real-world example of a barcode's neighborhood in presence of barcode reuse; four distinct communities imply the barcode contains at least four molecules.
Following this logic, Physlr iterates over all vertices and deconvolutes each barcode into its constituent molecules one at a time (Supplementary Figure S5). To achieve this deconvolution, we inspect each barcode's neighborhood subgraph, the vertex-induced subgraph of a barcode's immediate neighbors. We expect this subgraph to contain multiple communities, one per molecule (Figure 2b). Communities are detected (explained below), and the focal barcode is split into multiple molecule vertices, one per community; we connect each molecule-vertex to all vertices in its relative community.
Physlr detects communities in millions of subgraphs, each of which comprises hundreds to thousands of vertices. Although community detection is a well-studied topic, current state-of-the-art algorithms [45,46] failed to scale up for Physlr subgraphs (Supplementary Table S8). Hence, we devised a novel algorithm for community detection.
First, we randomly split larger subgraphs into smaller (sub-)subgraphs (Supplementary Figure S2). Next, we detect biconnected components. For each component, we connect every vertex to its second-order neighbors to increase communities' interconnectivity, increasing the signal-to-noise ratio. This is implemented by squaring the adjacency matrix. We then calculate a 2-dimensional cosine similarity matrix CS of the adjacency matrix (implemented in the same manner). The value of each element CS i,j reflects the extent of shared neighbors between nodes i and j. We adopt a threshold to remove weak connections and report connected components as subcommunities. Finally, we merge the resulting subcommunities if the merging increases the modularity (measures the strength of division of a network into modules) [47].
In summary, we use a divide and conquer approach: we divide the subgraphs, detect the communities in each division, and re-join related communities. As a result, despite processing millions of subgraphs, Physlr runs comparatively fast (Supplementary Figure S6  and Supplementary Tables S7 and S8).
Physlr also implements a divide-and-conquer version of other community detection algorithms, including tri-connected components, k-clique percolation [45], and Louvain community detection [46], and allows for a customized combination of all these choices in an iterative manner. A performance comparison between some potential combinations is provided in the Supplementary Results (Supplementary Table S8).

Discussion
Sequencing technologies are rapidly evolving, providing longer-range information. However, it is not yet a routine task to achieve assemblies with contiguity comparable to studies that benefit from the long-range information inherent in conventional physical maps. We revived the concept and presented a tool, Physlr, that constructs next-generation physical maps based on linked-read data. We showed that Physlr maps can almost cover human chromosomes in 1-5 pieces (<2.5 pieces on average). Furthermore, we used these generated physical maps to scaffold various draft human genomes assembled using four different sequencing platforms, including short-, linked-and long-read technologies, and showed substantial contiguity gains in each scenario. This suggests that Physlr can substantially boost assembly projects with linked-read data. Physlr may also be used to improve recently published linked-read genome assemblies [48][49][50][51][52] reusing the same data.
While traditional maps were mainly used in genome assembly and scaffolding projects (due to high mapping cost), their modern alternatives such as optical maps, long reads and, more recently, linked reads are used in a broader range of applications: personal genome assembly [53], structural variation and recombination detection [54][55][56][57], haplotyping assemblies and variants [23,58,59], assembly correction and evaluation [60,61], etc. In the same manner, next-generation physical maps would be applicable to this analysis spectrum. Physlr compiles long-range information from separated molecules into a unified physical map, enabling "longer-range" more robust inference-as demonstrated for scaffolding. Due to the high contiguity of Physlr physical maps, they have great potential to provide a big picture of the genome structure and to serve various downstream genomic studies.
We based our work on linked reads since their molecules tend to span longer genomic loci than long reads, and closer to molecules in conventional physical maps. Linked reads are sequenced through short-read sequencing instruments and thus have a very low error rate, which enables decent overlap detection in Physlr. As long-read technologies gradually close in on short linked reads in terms of error-rate, cost, and especially molecule (fragment) size, Physlr may be adapted to build a physical map of long reads.
The contiguity proxy NG50 can increase by solely introducing new misassemblies. To validate that Physlr increases the contiguity, we also looked at NGA50 which considers alignment blocks instead of contigs lengths and thus rules in the effect of misassemblies. Additionally, Figure 5 visually confirms that the contiguity of Physlr scaffolds was not due to large-scale misassemblies. Moreover, only a fraction of the joins that Physlr made were flagged as QUAST misassemblies (some of which are likely individual-specific structural variants).
Physlr currently trusts the input assembly over its physical map where they contradict. Thus, an initial assembly containing numerous misassemblies may restrain physical maps' potential by preventing Physlr from correcting the assembly and making many potential joins; in two of our experiments, Supernova generated numerous misassemblies in baseline assemblies and Physlr (and other tools) increased the contiguity only slightly. Thus, in devising an assembly pipeline, we suggest including an assembly correction tool like Tigmint [61] prior to Physlr scaffolding. Sequencing technology-specific assemblers are available to use upstream in the pipeline, as needed [37][38][39][62][63][64].
Physlr requires at least two physical map nodes (molecules) mapping to a contig of the input assembly to anchor, orient, and scaffold the contig. Thus, we recommend that the input assembly contains contigs larger than the size of one molecule (>100 kbp).
To the best of our knowledge, Physlr is the only scalable tool that can deconvolute reused barcodes into their associated molecules de novo. For this purpose, we devised a novel community detection algorithm based on a divide-and-conquer approach, cosine similarity, and k-clique percolation, while its customizable pipeline works with other wellknown algorithms as well. The community detection algorithms' performance heavily relies on the topological nature of the network/graph [65]. Our algorithm outperformed all others on our overlap graphs; thus, we promote it as a potential community detection algorithm that would suit other studies.

Conclusions
With Physlr, we introduced next-generation physical maps based on linked reads and demonstrated the potential of the physical maps to benefit genomic studies by showcasing improvements in scaffolding genome assemblies. Physlr outperformed state-of-the-art linked-read scaffolders and substantially increased the contiguity (both NG50 and NGA50, for all eight human assemblies considered) while performing well in keeping misassemblies comparatively low.
Physlr may be used to scaffold genome assemblies in linked-read or hybrid projects or to generate physical maps and empower other downstream applications and studies.