A Multireference-Based Whole Genome Assembly for the Obligate Ant-Following Antbird, Rhegmatorhina melanosticta (Thamnophilidae)

: Current generation high-throughput sequencing technology has facilitated the generation of more genomic-scale data than ever before, thus greatly improving our understanding of avian biology across a range of disciplines. Recent developments in linked-read sequencing (Chromium 10 × ) and reference-based whole-genome assembly o ﬀ er an exciting prospect of more accessible chromosome-level genome sequencing in the near future. We sequenced and assembled a genome of the Hairy-crested Antbird ( Rhegmatorhina melanosticta ), which represents the ﬁrst publicly available genome for any antbird (Thamnophilidae). Our objectives were to (1) assemble sca ﬀ olds to chromosome level based on multiple reference genomes, and report on di ﬀ erences relative to other genomes, (2) assess genome completeness and compare content to other related genomes, and (3) assess the suitability of linked-read sequencing technology for future studies in comparative phylogenomics and population genomics studies. Our R. melanosticta assembly was both highly contiguous (de novo sca ﬀ old N50 = 3.3 Mb, reference based N50 = 53.3 Mb) and relatively complete (contained close to 90% of evolutionarily conserved single-copy avian genes and known tetrapod ultraconserved elements). The high contiguity and completeness of this assembly enabled the genome to be successfully mapped to the chromosome level, which uncovered a consistent structural di ﬀ erence between R. melanosticta and other avian genomes. Our results are consistent with the observation that avian genomes are structurally conserved. Additionally, our results demonstrate the utility of linked-read sequencing for non-model genomics. Finally, we demonstrate the value of our R. melanosticta genome for future researchers by mapping reduced representation sequencing data, and by accurately reconstructing the phylogenetic relationships among a sample of thamnophilid species.


Introduction
Organismal biology has been revolutionized over the past decade by the 'omics' era, in which the rapid development of high-throughput sequencing technologies has enabled the acquisition of more genetic data than ever before, including for non-model organisms [1]. For example, collaborative endeavours to sequence thousands of bird (Bird10K project [2]) and other vertebrate genomes (Genome10K project [3]) across many countries and research groups have been launched in the past decade, and have produced promising results [4][5][6][7]. Thus, high-throughput sequencing has rapidly improved our ability to make robust inferences in various fields, including avian systematics [7][8][9], (2) assess genome completeness and content relative to other published genomes, and (3) assess the suitability of linked-read sequencing assemblies in mapping reduced-representation markers that are broadly implemented in comparative phylogenomics and population genomics studies.

Genome Sequencing and De Novo Assembly
The study specimen was a wild caught adult female Reghmatorhina melanosticta from San Martin, Peru (Museum of Southwestern Biology voucher MSB:Birds:36483; http://arctos.database.museum/guid/ MSB:Bird:36483). Muscle, heart and liver samples were frozen in liquid nitrogen and stored at −80 • C. Muscle tissue was transported to the sequencing facilities on dry ice for preservation of DNA quality. DNA extraction, linked-read library preparation and sequencing were carried out at the HudsonAlpha Genome Sequencing Center facilities (Huntsville, Alabama; https://hudsonalpha.org/sequencing/). High molecular weight DNA was extracted with Qiagen's MagAttract Kit (Qiagen, Valencia, California). Fragment lengths were verified to be over the minimum ideal length for linked-read sequencing libraries (>50 Kb) with pulsed field gel analysis. The library for the 10× Chromium platform sequencing implements bead-in-emulsion barcoding to add location-barcodes to fragments that originated from a single long DNA molecule [57]. This barcode is then used to re-assemble the short reads into pseudo long-reads post sequencing [57]. The paired-end library was sequenced with the HiSeq X Illumina platform, with sequence read length of 150 bp and average insert size of 350 bp.
We implemented the Chromium Genome Software Suite package for raw read processing, scaffold level genome assembly and structural variant mapping. Raw reads processing and de novo genome assembly were done with the software Supernova version 2.1 [58,59], which includes adapter trimming within its pipeline. The raw reads were demultiplexed and assembled to scaffolds with default settings of the mkfastq and run functions, respectively. We ran Supernova version 2.1 assembler on 40 threads and 1Tb RAM on the Sackler Institute for Comparative Genomics private server at the American Museum of Natural History for three days. The final genome sequences were generated with the mkoutput function under the "pseudohap2" style [36]. The "pseudohap2" option generates two parallel fasta files corresponding to the paternal and maternal haplotypes of the sequence. This option flattens bubbles in variant regions by randomly selecting an allele and assigning it to one of the two haplotypes, resulting in two final genome sequences composed of scaffolds with mixed occurrence of paternal and maternal haplotypes.
We used the Longranger software version 2.2.2 to map and phase structural variants in the R. melanosticta genome [59]. Longranger implements the linked-read barcode information to enhance the performance of external variant calling software by mapping the 10× raw reads to a reference genome. Longranger performs optimally when using references with a reduced number of scaffolds (preferably under 1000 scaffolds). We mapped single nucleotide polymorphisms (SNPs) and variants only to scaffolds over 150 Kb. This length was chosen after an exploratory analysis showed a good trade-off; optimally reducing the number of scaffolds in the reference genome without losing a significant amount of sequence. The 150 Kb cutoff reduced the number of scaffolds by 13%, while only reducing the total genome length by one percent (96 excluded scaffolds with a total of 11.7 Mb). We mapped variants by running the wgs function of Longranger with the Genome Analysis ToolKit (GATK) [60] as the variant caller, and used the 619 scaffolds of the de novo assembled genome from the previous step as reference. Longranger filters variants that have VCF standard phred-scaled quality score lower than 15 (QUAL < 15) or 50 (QUAL < 50) if they are heterozygous or homozygous, respectively. Heterozygous sites with allele fraction under 15% are also excluded.

Single Reference Assisted Chromosome Level Assembly
We first assembled the R. melanosticta scaffolds to chromosomes using the software Chromosomer [40]. This method generates draft chromosome level assemblies based on a BLAST  [61] between the reference genome and target genome scaffolds [39]. The algorithm considers a scaffold to be anchored to a specific position on the reference genome if the ratio between the first and second highest alignment score is higher than a predefined ratio threshold (default is 1.2). If the ratio is lower than the threshold, the scaffold is not mapped; it is listed as unplaced (on a given chromosome) if the two best hits map to the same chromosome, or unlocalized, if the best hits are on different chromosomes [40]. Although Chromosomer is ideally implemented for genomes from closely related taxa, the conserved nature of avian genomes [62] likely reduces the rate of insertion errors associated with an increase in phylogenetic distance between the reference and target taxa. We mapped scaffolds to all available chromosome-level genomes of the order Passeriformes found on GenBank and to relatively high-quality genomes representing four other avian orders ( Table 1). The total sample, in order of increasing relatedness (all reference passerines are of the suborder Passeri and are therefore equidistant to R. melanosticta), included (1)  We converted the masked repeat regions from GenBank assemblies to BLAST readable masks with the convert2blastmask function from the NCBI BLAST+ package [68], implementing the "repeatmasker default" option. We created BLAST databases from each reference with makeblastdb and aligned R. melanosticta scaffolds to reference genome databases with blastn [68]. We mapped the target scaffolds to the reference genomes with the fragmentmap function from the Chromosomer package [40]. We set the gap size between non-overlapping scaffolds to 500 bp, which is higher than our maximum read insert-size [40]. Finally, the chromosomes were assembled with default options of the assemble function.

Multiple Reference Assisted Chromosome Level Assembly
The consistency of sequence adjacency across multiple genomes adds powerful information to referenced-based chromosome assembly [39,41]. We used the software Ragout 2 [39] to assemble the R. melanosticta scaffolds into chromosomes. Ragout uses phylogenetic information to reconstruct the most likely chromosome rearrangements for the target genome [39]. First we assembled the W chromosome separately based on Ficedula albicollis (ENA accession code PRJEB7359) [69], Calypte anna (Table 1) and Gallus gallus (GenBank accession number NC_006126.5) [70]; these were the only assembled W chromosomes we found to be publicly available. As our sample is female, we took this approach to (a) assemble the W chromosome of R. melanosticta and (b) exclude confounding W chromosome scaffolds that would not be correctly mapped to any scaffolds of the reference genomes, given that none of the genomes used for multireference assembly had the W chromosome. Then, we mapped the remaining scaffolds to the five available genomes of the taxa closest to R. melanosticta; all four passerine genomes used in the previous step and to S. habroptila, representing the sister group to Passeriformes (Table 1). We created the input genome alignments in hal format for the Ragout runs with Cactus, using default options [71]. The phylogenetic topology used as reference for the alignment was based on a recently published tree for passerines [18] with no branch length information (all branch lengths = 1, Supplementary Figure S1), and for the W chromosome the topology was based on another study [8]. Given the high contiguity and sequencing depth of the scaffolds from the Supernova de novo assembly, we then ran Ragout version 2.2 with the solid-scaffolds option. The W chromosome of F. albicollis was set to "draft" because it is assembled only to the scaffold level. All other settings were left to default options. Final chromosome names in the R. melanosticta genome were based on F. albicollis chromosomes (randomly selected by Ragout).
Finally, we visually assessed synteny between the R. melanosticta genome and the nine genomes used as references by creating synteny plots between the multiple-reference-based assembly (Ragout) and each single-reference-based assembly (Chromosomer). Instead of performing regular analyses of synteny, which usually involve anchoring homologous sites from target to reference genomes [72], we compared the scaffold-to-chromosome assignment for the R. melanosticta genome assemblies based on single references as an assessment of synteny across Aves. This approach allowed us to assess synteny while comparing the performance of the multireference and single-reference assembly methods. We used the single-reference assemblies of the antbird to make graphical representations of the genomes of the reference taxa because these assemblies are highly constrained to the reference genomes' structure, although this approach loses any portion of the reference genomes that were not mapped to the R. melanosticta scaffolds. This strategy underestimates the amount of intra-chromosomal rearrangements in R. melanosticta in relation to other taxa, because the order of scaffold placement is guided by the sequence of the reference taxon. However, the high contiguity of the de novo assembled scaffolds (see Results) guarantees that some of the inherent arrangement of R. melanosticta genome remains represented within the scaffolds and allows for some assessment of synteny with other genomes (minimum suggested N50 for synteny representation is 1 Mb [72]). We plotted synteny maps in R version 3.5 [73] with the package 'circlize' [74], and chromosome ideograms with 'karyoploteR' [75]. Gaps inserted between scaffolds were removed from both types of plots for visualization purposes. Pseudo chromosome fragments (PCF) were concatenated for genome visualization with the orientation of the output fasta sequences (i.e., their concatenation does not represent actual sequence orientation or order in the R. melanosticta genome).

Evaluation of Genome Completeness
We evaluated genome completeness through direct assessment of assembly metrics such as expected and observed genome length, number of scaffolds and gap length, as well as content of well-known genomic regions of interest, such as target-capture markers and conserved single-copy genes. We estimated the proportion of sequence missing from our assembly by subtracting total gap length and final genome length from the expected genome length [76]. We used the haploid DNA content (in pg) based on flow cytometry of Rhegmatorina melanosticta (from the same specimen we used [77]) converted to Gb assuming 1 pg = 0.978 Gb [78] as an independent estimate of genome size, as well as Supernova's default genome size estimate based on kmer distribution. We estimated within-scaffold gap lengths (number of bases marked as "N") with the function comp from the seqtk package [79].
To evaluate the completeness of our genome assembly in relation to sequence content, we used the software Benchmarking Universal Single-Copy Orthologs version 3 (BUSCO) [80,81]. BUSCO measures genome completeness by quantifying the proportion of known genes from compiled datasets that are only present in genomes as single copies and are highly conserved (i.e., they are evolving under "single-copy control" and so conserved that they should be detectable in a variety of organisms [82]). BUSCO genes are good candidates for assessing genome completeness because the expectation that they are present in a given genome is reasonable from an evolutionary perspective [80][81][82][83]. We ran BUSCO on the R. melanosticta genome, plus nine related (Eufalconimorphae sensu [7,84]) genomes.
We first chose five genomes that were assembled to chromosome level and used as outgroups in our chromosome mapping approach: Falco peregrinus, Strigops habroptilus, Passer domesticus, Taeniopygia guttata, and Parus major. We then chose an additional four genomes from a recent study that sequenced dozens of bird genomes [7]: Nestor notabilis, Acanthisitta chloris, Corvus brachyrhynchus, and Manacus vitellinus. BUSCO outputs were summarized in three metrics: (1) percent complete BUSCOs (complete sequence matches), (2) percent fragmented BUSCOs (partial sequence matches), and (3) percent missing BUSCOs (unmatched BUSCO sequences). We finally compared missing genes across all species to those missing from R. melanosticta in order to understand whether missing genes were consistent or variable among all assemblies.
To evaluate the efficacy of harvesting target-capture data from linked-read sequencing genomes, we also mapped the Tetrapods-UCE-5kv1 probeset, which targets 5060 ultraconserved elements (UCEs; https://www.ultraconserved.org/). UCEs are genome-wide markers that are informative at both deep and shallow evolutionary timescales, and which have become widely used for phylogenomic and population genomic studies [12,13,85]. To do this, we used the phyluce pipeline for harvesting UCEs from genomes [86]. We converted the de novo assembled R. melanosticta genome from fasta to twoBit and extracted sequence length information from it with the faToTwoBit and twoBitInfo tools from the Kent Source Archive [87] We then aligned and harvested the UCE loci from the genome using scripts in the phyluce package.

Genotyping-by-Sequencing (GBS) Reference Mapping
In order to demonstrate the efficacy of our genome for potential future research, we mapped genotyping-by-sequencing data for six species in the family Thamnophilidae. Specimens were provided by two institutions, the American Museum of Natural History (AMNH) and the Museu Paraense Emilio Goeldi (MPEG) and included Thamnophilus aethiops (AMNH LJM 225), Myrmotherula menetriesii (AMNH GT104), Myrmotherula longipennis (AMNH GDR 275), Willisornis poecilinotus (AMNH GDR239), Hypocnemis rondoni (AMNH LJM325), and Phlegopsis nigromaculata (MPEG T15868). Library prep and sequencing was undertaken at the University of Wisconsin Biotechnology Center (Madison, WI) using Pstl and Mspl enzymes, with only the latter as the cutter. The 150 bp paired-end sequencing was performed on an Illumina NovaSeq 6000. We then used ipyrad 0.7.30 [88] to trim low-quality bases (minimum quality score = 20) from the raw Illumina reads, map the cleaned reads to the R. melanosticta reference genome at a 70% clustering threshold, and identify GBS loci with minimum statistical read depth of six. Mapped loci with more than ten ambiguous base calls, 15 heterozygous sites, or 12 SNPs were discarded to eliminate possibly erroneous and non-orthologous alignments. Because sequence divergence across these species is expected to be about 0.01-0.02 substitutions per site given a highly conserved coding gene [48], we chose these settings to allow for the somewhat higher levels of divergence expected from GBS loci. Sequences for all loci were then concatenated.
To determine the utility of the R. melanosticta assembly as a reference genome, we reconstructed the phylogenetic relationships of the six Thamnophilid species using RAxML version 8.2.4 [89], assuming a root demonstrated in previous works [42,48]. We applied 20 maximum likelihood searches assuming a GTR + gamma model of nucleotide substitution across the entire dataset. We additionally applied 500 bootstrap replicates to evaluate the robustness of each node given our data. Our expectation in employing these analyses is that a more complete genome will yield thousands of mapped GBS loci resulting in accurate, robust phylogenetic reconstruction, whereas a less complete genome would not.

De Novo Assembly
We generated 560.02 million reads with a mean length of 140 b. The de novo assembly size was 1.03 Gb with raw and effective coverage of 62× and 38×, respectively. The fraction of sequence duplication was 7.5%, and the GC content of the assembly was 42.2%. The contig N50 was 136.8 Kb. The final genome had 715 scaffolds ranging from 13.8 to 0.1 Mb in length, with a N50 of 3.3 Mb. There were 5.3 million SNPs in the R. melanosticta genome, out of which 99.8% were phased. The longest phaseblock and the phaseblock N50 were 6.9 Mb and 1.9 Mb, respectively. Out of the 257 large structural variants that were called (over 30 Kb in size), 163 were deletions, 4 were sequence inversions, 22 were sequence duplications and 68 were distal (of at least 500 Kb) sequence translocations (Figure 1,  Supplementary Figure S2). In addition to large structural variants, 3797 short deletions (from 50 bp to 30 Kb size range) were detected.

De Novo Assembly
We generated 560.02 million reads with a mean length of 140 b. The de novo assembly size was 1.03 Gb with raw and effective coverage of 62× and 38×, respectively. The fraction of sequence duplication was 7.5%, and the GC content of the assembly was 42.2%. The contig N50 was 136.8 Kb. The final genome had 715 scaffolds ranging from 13.8 to 0.1 Mb in length, with a N50 of 3.3 Mb. There were 5.3 million SNPs in the R. melanosticta genome, out of which 99.8% were phased. The longest phaseblock and the phaseblock N50 were 6.9 Mb and 1.9 Mb, respectively. Out of the 257 large structural variants that were called (over 30 Kb in size), 163 were deletions, 4 were sequence inversions, 22 were sequence duplications and 68 were distal (of at least 500 Kb) sequence translocations (Figure 1, Supplementary Figure S2). In addition to large structural variants, 3797 short deletions (from 50 bp to 30 Kb size range) were detected. The green rectangle represents two scaffolds that were homologous to Chr5 and Chr1 in multireference assembly, and the yellow rectangle represents scaffolds placed on Chr4 on single-reference assemblies.

Reference-Based Assembly
The number of scaffolds mapped to chromosomes based on a single-reference ranged from 577 to 695 and did not vary exclusively due to phylogenetic distance from reference. While the range of Figure 1. Chromosome ideogram. The black vertical lines represent UCE (Ultra Conserved Elements) placement in chromosome and the gray links represent large structural variants (insertions/deletions, inversions, duplications and rearrangements) over 30 Kb in size. The green rectangle represents two scaffolds that were homologous to Chr5 and Chr1 in multireference assembly, and the yellow rectangle represents scaffolds placed on Chr4 on single-reference assemblies.

Reference-Based Assembly
The number of scaffolds mapped to chromosomes based on a single-reference ranged from 577 to 695 and did not vary exclusively due to phylogenetic distance from reference. While the range of scaffolds mapped was similar from passerines to G. gallus (most distantly related bird used), the fewest number of scaffolds mapped were in the C. anna and C. livia assemblies ( Table 2). The number of chromosomes created ranged from 29 to 38, and closely reflected the number of chromosomes in reference assemblies. The length of gaps added to final assemblies ranged from 0.11 to 0.18 Mb (Table 2). Table 2.
Comparison of assemblies of R. melanosticta based on multiple references (first column: Ragout [39], references marked by an asterisk) and single reference genome assembly (Chromosomer [30]) corresponding to column names. Mapped refers to number of scaffolds mapped to genome (Ragout) or the sum of mapped and unlocalized (mapped to chromosome but with final position undefined) scaffolds in Chromosomer. The references are listed from furthest to closest in phylogenetic distance (passerines are listed alphabetically as they are equidistant to suboscine). The genome assembled based on multiple reference genomes (henceforth referred to as the multireference genome or assembly) was composed of 46 pseudo-chromosome fragments (PCFs), which formed 27 chromosomes (Figure 1). Fifteen chromosomes were formed by a single PCF, nine chromosomes were formed by two separate PCFs, three were formed by three PCFs and the Z chromosome was formed by four disjointed PCFs. The PCFS ranged in length from 0.29 to 103.33 Mb, which correspond to the whole chromosome 22 PCF and one of the two PCFs of chromosome 3 PCFs, respectively. The genome assembly placed 595 of the 715 de novo assembled scaffolds, while 120 scaffolds (56.17 Mb) remained unplaced. Although the number of scaffolds that were unplaced in the multireference assembly was an order of magnitude larger than those unplaced in single-reference assemblies, the unplaced scaffolds represented only five percent of the original de novo assembly. The total assembly length was 990.29 Mb in 163 scaffolds (unplaced and PCFs combined), including gaps introduced between placed scaffolds (42.89 Mb, 4.3% of the assembly). The multireference assembly N50 was 53.31 Mb.

Genome Completeness
The genome size estimated by Supernova and flow cytometry were very similar: 1.36 Gb and 1.3 Gb [77], respectively. The total length of gaps within scaffolds was 12.2 Mb (Table 3). Based on the estimated genome size, 282.2 Mb (21%) were either not sequenced or were assembled as unidentified nucleotides ("N"). The length of gaps in scaffolds were shorter in R. melanosticta than three of the five long-read assembled genomes (Table 3). In contrast, the number of scaffolds in the R. melanosticta was higher than all but one long-read assembled genome (but see [76] for example of an alternative Gallus gallus genome not used here).
Our analysis of conserved genetic marker content showed that the R. melanosticta genome was relatively complete. After extracting UCE markers from the R. melanosticta genome, we found that 87% of 5060 UCEs were present in the genome. BUSCO analysis revealed that our R. melanosticta genome was relatively complete, with 89.2% (n = 4384) of BUSCOs detected as complete sequences, just 3.5% (n = 172) as fragmented sequences, and only 7.3% (n = 355) missing in the assembly (Figure 2). This number was very similar to that of the other evaluated species, with a lower proportion of fragmented genes, but slightly higher levels of missing genes. Of the 355 BUSCOs that were missing in R. melanosticta, 16.3% (n = 58) were also missing from C. brachyrhynchos, 15.2% (n = 54) from P. major, 17.5% (n = 62) from P. domesticus, 26.5% (n = 94) from T. guttata, 14.1% (n = 50) from M. vitellinus, 14.9% (n = 53) from A. chloris, 13.8% (n = 49) from N. notabilis, 25.4% (n = 90) from S. habroptilus, 12.7% (n = 45) from F. peregrinus, and 6.5% (n = 23) were missing in all six genomes. Table 3. Comparison of the Rhegmatorhina melanosticta linked-read de novo assembly with PacBio long-read bird assemblies. Expected genome size is based on chromosome density from flow cytometry (from [76,77,90]). Missing (Mb) is an estimate of unsequenced genome (expected assembly size subtracted from the assembly size [76]). Gaps is the total number of "N"s in the assembly. The percentage of missing sequence is the sum of unsequenced genome and gap length relative to expected genome size. Corvus cornix GenBank accession number: GCA_002023255.2 [91].  Table 3. Comparison of the Rhegmatorhina melanosticta linked-read de novo assembly with PacBio long-read bird assemblies. Expected genome size is based on chromosome density from flow cytometry (from [76,77,90]). Missing (Mb) is an estimate of unsequenced genome (expected assembly size subtracted from the assembly size [76]). Gaps is the total number of "N"s in the assembly. The percentage of missing sequence is the sum of unsequenced genome and gap length relative to expected genome size. Corvus cornix GenBank accession number: GCA_002023255.2 [91].  [8,18]. Bars represent the proportion of complete (blue), fragmented (pink), and missing BUSCOs (gold) for each genome.

Synteny
Our mapping of synteny recovered relatively consistent results among species: the chromosome placement of sequences within the R. melanosticta multireference genome was similar to the placement of homologous sequences throughout Aves (Figure 3). However, two species assemblies (S. habroptila and F. peregrinus) were structurally different ( Figure 3E,F). Out of the two taxa, F. peregrinus had the most rearranged genome in relation to R. melanosticta. Chromosomes 5 and 6 in S. habroptila and R. melanosticta multireference genomes were highly syntenic. Four scaffolds of chromosome 1 in the R. melanosticta multireference genome consistently mapped to chromosome 4 in all assemblies but in F. peregrinus, where it was placed in chromosome 2 (Figure 1, yellow rectangle). This difference reflects the homology of chromosome 2 in F. peregrinus to chromosome 4 in other birds (Figure 3). One of the PCFs created by Ragout2 (2 scaffolds with a total of 520.2 Kb) was homologous to both chromosomes 5 and 1 of F. albicollis. This was the only case in which a PCF was ambiguously placed in the multireference assembly. As the PCF had more sequences homologous to chromosome 5 in the five reference genomes, we decided to represent it as part of chromosome 5 in the genome plots (highlighted in green in Figure 1). One of the scaffolds that were part of the translocation from chromosome 4 to 1 also had a distal structural variation detected by  [8,18]. Bars represent the proportion of complete (blue), fragmented (pink), and missing BUSCOs (gold) for each genome.

Synteny
Our mapping of synteny recovered relatively consistent results among species: the chromosome placement of sequences within the R. melanosticta multireference genome was similar to the placement of homologous sequences throughout Aves (Figure 3). However, two species assemblies (S. habroptila and F. peregrinus) were structurally different ( Figure 3E,F). Out of the two taxa, F. peregrinus had the most rearranged genome in relation to R. melanosticta. Chromosomes 5 and 6 in S. habroptila and R. melanosticta multireference genomes were highly syntenic. Four scaffolds of chromosome 1 in the R. melanosticta multireference genome consistently mapped to chromosome 4 in all assemblies but in F. peregrinus, where it was placed in chromosome 2 (Figure 1, yellow rectangle). This difference reflects the homology of chromosome 2 in F. peregrinus to chromosome 4 in other birds (Figure 3). One of the PCFs created by Ragout2 (2 scaffolds with a total of 520.2 Kb) was homologous to both chromosomes 5 and 1 of F. albicollis. This was the only case in which a PCF was ambiguously placed in the multireference assembly. As the PCF had more sequences homologous to chromosome 5 in the five reference genomes, we decided to represent it as part of chromosome 5 in the genome plots (highlighted in green in Figure 1). One of the scaffolds that were part of the translocation from chromosome 4 to 1 Diversity 2019, 11, 144 10 of 18 also had a distal structural variation detected by Longranger: a region of this scaffold is translocated to the PCF homologous to both chromosomes 1 and 5 described above (Figure 1). Longranger: a region of this scaffold is translocated to the PCF homologous to both chromosomes 1 and 5 described above (Figure 1).

GBS Reference Mapping
Reference mapping of the GBS data resulted in 55,753 loci total, and 4870 orthologous loci that met our conservative criteria for data retention. Our concatenated matrix of retained loci contained a total of 605,952 bp. RAxML resulted in a topology consistent with previous studies, with 100% bootstrap support across all nodes. Assuming our root was correct based on previous studies, we specifically recovered Hypocnemus to be sister to Willisornis plus Phlegopsis, and these (Tribe Pyithyini) to be sister to Myrmotherula plus Thamnophilus (Tribe Thamnophilini) [48,92]. However, we represent our results as an unrooted network to avoid this assumption (Supplementary Figure S3).

General Genome Structure, Contiguity and Content
We herein present a reference-based chromosome-level genome assembly for the obligate ant-following Rhegmatorhina melanosticta. This genome represents one of the few publicly available genomes of a suboscine (suborder: Tyranni), and the first of any species of the infraorder, Furnariides [48,92]. The proportion of the original assembly that was effectively placed in the final assembly (Ragout) is comparable to that of three other chromosome-level bird genomes assembled based on multiple references [41,93]. Our assembly of the R. melanosticta genome is both highly contiguous and relatively complete in terms of BUSCO scores ( Figure 2). In terms of completeness, we showed that our genome is similar to those of other published assemblies ( Figure 2) [4,7,94]. Although our genome was missing a slightly higher proportion of BUSCOs than most of the other genomes we evaluated, it also contained a lower number of fragmented BUSCOs, consistent with our assessment of high contiguity (i.e., high N50). BUSCO completeness measures correlate poorly with scaffold N50 making the two metrics good independent assessments of genome quality [80]. Additionally, we successfully mapped reduced representation genetic markers from both UCEs and GBS data, further demonstrating the high completeness of our assembly.
We found that the contiguity of short linked-read scaffolding has a comparable performance to that of long-read assemblies, albeit with more gap within sequences. The high scaffold number in R. melanosticta in relation to long-read assemblies likely reflects the concentration of repeat regions towards the scaffold breaks when using short-read sequencing [91], but mapping repeat regions was beyond the scope of this paper. The large number of scaffolds left unplaced to any given chromosome in the final assembly consisted mostly of short scaffolds (<1 Mb). These unplaced scaffolds were likely regions that are difficult to assemble and are therefore highly fragmented, such as highly repetitive genome regions [69]. The high occurrence of "short" scaffolds in the sex chromosomes, which have high densities of repetitive sequence in relation to autosomes, corroborate the likely high density of difficult-to-assemble repeat regions in these scaffolds (Supplementary Figure S4). Furthermore, the concentration of structural variants in these chromosomes (Figure 1, Supplementary Figure S2) in relation to autosomal chromosomes is likely due to the propensity of repeat regions to undergo insertions/deletions [95]. Additionally, some unplaced scaffolds likely correspond to whole microchromosomes, and perhaps even to pathogenic DNA [94]. Given that most bird genomes, including suboscines, have a 2n of 80 on average [76], the final R. melanosticta assembly is likely missing around 13 microchromosomes.

Analysis of Synteny
Our synteny plots reflect the relative stability of avian chromosome structure relative to other groups of organisms [62,64]. These results expand upon the high syntenic relationship found for the genomes of other passerines [67,96], as well as for non-passerine birds (e.g., Struthio camelus [93] and G. gallus [67]). It was expected that the level of synteny between F. peregrinus and S. habroptilus with the R. melanosticta genome would be lower than for other genomes; Falconiformes and Psitacifformes have highly rearranged genomes within Aves [64,93]. Regardless of their high level of rearrangement, the chromosome 1 fragment of the R. melanosticta genome that mapped to chromosome 4 in all birds was located in chromosome 2 of F. peregrinus, which is homologous to chromosome 4 in other birds. This pattern highlights the conserved nature of sequences among avian groups [67]. Given that the R. melanosticta chromosome assemblies were reference based, the synteny plots likely reflect mostly the synteny between the references. However, the syntenic relationship between R. melanosticta de facto chromosomes and the nine genomes used as reference were likely detected, as we have identified idiosyncratic R. melanosticta structure in the multireference genome. However, teasing these influences apart would only be possible with additional efforts in placing de novo assembled scaffolds, such as optical mapping or chromosome conformation capture. The translocation from chromosome 4 to 1 could represent an actual translocation from the ancestral chromosome 4 to both chromosomes 1 and 5 in R. melanosticta; the structural variant map created with LongRanger showed that a sequence fragment on the alternate haplotype of one of these scaffolds is translocated to a scaffold mapped to chromosome 5 or 1 of the multireference assembly ( Figure 1). Alternately, the two scaffolds homologous to both chromosomes 5 and 1 could actually be in chromosome 1 in R. melanosticta. A third possibility is that these two difficult to place scaffolds are a whole microchromosome that split off from the ancestral chromosomes 1 and 5. Whichever the case, R. melanosticta harbors variation in placement of at least a part of the translocated sequence, which could help unravel the path through which the split and/or merging of these genomic regions took place. Sequencing other suboscine genomes as well as using different sequencing tools, such as PacBio long reads or Hi-C mapping will clarify the placement of these sequences in the R. melanosticta genomes and when this event happened since the split from oscine.
The translocations identified in our synteny plots likely underestimate the actual genomic translocations in the R. melanosticta genome in relation to other avian genomes, given that our chromosome level assemblies were contingent on reference sequences. However, we have demonstrated that using information of multiple genomes in conjunction to place scaffolds to chromosomes allows for the emergence of patterns intrinsic to the target genome: the multireference assembly placed a group of scaffolds that were mapped to chromosome 4 in single-reference assemblies on chromosome 1 of R. melanosticta. The occurrence of rearrangements in this genomic region was detected by Longranger independently and without the use of other avian genomes as reference in the distal translocation of sequences between scaffolds in the same region ( Figure 1). This performance is likely possible due to the conservation of sequence adjacency information in the highly contiguous de novo assembled scaffolds, which is an important source of information for reference-based scaffold-to-chromosome placement [39,72].

Linked-Read Genome Applicability in Comparative Phylogenomics
We successfully extracted nearly all (87%) UCE loci from the genome, a number comparable to that of phylogenomic studies of birds that employed reduced representation genomic libraries with the same probeset [97][98][99][100]. This finding not only demonstrates the genome's completeness but also its potential for future incorporation into the growing number of studies utilizing UCEs for phylogenomic research [18,97,99,101]. The use of allele information adds power to phylogenetic and demographic reconstructions based on target-capture libraries [102], but phasing procedures of short-read non-model genomes are prone to errors [103]. In addition to recovering a high number of UCE loci in the genome, the SNPs recovered with the aid of linked-read barcodes are phased with high accuracy into long phaseblocks.
As a more in-depth assessment of the genome's utility for comparative phylogenomics, we additionally mapped GBS data and used the identified loci to reconstruct the phylogenetic relationships of a sample of antbirds. In doing so, the resulting RAxML phylogeny agreed with the well-accepted taxonomic relationships that have been recovered in multiple previous studies [42,48]. Therefore, we have demonstrated that our R. melanosticta genome is valuable for a range of potential future uses. These results also corroborate the idea that GBS data-typically considered most useful at population-genetic scales-can be highly informative even at relatively deep timescales [100,104,105].

Conclusions
Overall, we have demonstrated that linked-read technologies are valuable resources for generating high contiguity genomes, with relatively complete sequencing of putatively conserved genomic regions. Because of this high contiguity, we were able to phase SNPs in relatively long phaseblocks, a result that indicates significant informativeness of this genome for a range of studies. Similarly, the high completeness of our assembly relative to other publicly available genomes means that new data are available for incorporation into ongoing phylogenomic and population genomic work on avian evolution. We additionally showed that multireference-based assembly methods areuseful for assembling scaffolds and comparing genomic structure across the avian tree of life and found relatively stable chromosomal structure spanning at least 65 my of evolutionary history. Notably, we detected a single sequence transfer from chromosome 4 to chromosome 1 in R. melanosticta in relation to the other genomes we evaluated. Future work should determine whether this translocation is synapormorphic of Tyranni, Thamnophilidae, or another node in the R. melanosticta history. We demonstrated that genomic structure unique to the target taxon can be detected with the guidance of reference genomes in assembling scaffolds to chromosomes.

Supplementary Materials:
The following are available online at http://www.mdpi.com/1424-2818/11/9/144/ s1, Figure S1: the phylogeny used to inform the whole-genome alignment. The topology was extracted from [18]. Figure S2: number of structural variants (insertions/deletions, duplications, rearrangements) mapped to Rhegmathorina melanosticta chromosomes in relation to chromosome length. The red points represent the sex chromosomes. Figure S3: phylogenetic relationships for some birds in the family, Thamnophilidae. The topology was inferred using RAxML based on 4,870 GBS loci mapped to the R. melanosticta genome. The relationships are consistent with well-accepted phylogenetic hypotheses [34,40]. Figure S4: distribution of lengths of scaffolds that form each chromosome (thin lines) in the Rhegmathorina melanosticta genome. The red and green lines correspond to the W and Z chromosomes, respectively. The bold line corresponds to the length distribution of all scaffolds combined. The blue dashed line represents the scaffold N50 (3.2 Mb).
Author Contributions: L.A.C. and J.C. conceived of and designed the study. L.A.C. and L.J.M. designed and carried out all analyses and wrote the paper. All authors edited the manuscript for intellectual content.