Delineating the Tnt1 Insertion Landscape of the Model Legume Medicago truncatula cv. R108 at the Hi-C Resolution Using a Chromosome-Length Genome Assembly

Legumes are of great interest for sustainable agricultural production as they fix atmospheric nitrogen to improve the soil. Medicago truncatula is a well-established model legume, and extensive studies in fundamental molecular, physiological, and developmental biology have been undertaken to translate into trait improvements in economically important legume crops worldwide. However, M. truncatula reference genome was generated in the accession Jemalong A17, which is highly recalcitrant to transformation. M. truncatula R108 is more attractive for genetic studies due to its high transformation efficiency and Tnt1-insertion population resource for functional genomics. The need to perform accurate synteny analysis and comprehensive genome-scale comparisons necessitates a chromosome-length genome assembly for M. truncatula cv. R108. Here, we performed in situ Hi-C (48×) to anchor, order, orient scaffolds, and correct misjoins of contigs in a previously published genome assembly (R108 v1.0), resulting in an improved genome assembly containing eight chromosome-length scaffolds that span 97.62% of the sequenced bases in the input assembly. The long-range physical information data generated using Hi-C allowed us to obtain a chromosome-length ordering of the genome assembly, better validate previous draft misjoins, and provide further insights accurately predicting synteny between A17 and R108 regions corresponding to the known chromosome 4/8 translocation. Furthermore, mapping the Tnt1 insertion landscape on this reference assembly presents an important resource for M. truncatula functional genomics by supporting efficient mutant gene identification in Tnt1 insertion lines. Our data provide a much-needed foundational resource that supports functional and molecular research into the Leguminosae for sustainable agriculture and feeding the future.


Introduction
Sustainable agricultural production involves growing food with low fertilizer input without damaging the underlying soil [1]. Legumes are of great interest for sustainable agriculture because they produce nitrogen via symbiotic nitrogen fixation, improving soil health [2,3]. Most legumes, however, have large/complex genomes and are outcrossing species, making genetic studies difficult. Medicago truncatula was chosen as a model legume due to its small genome [4], diploidy, autogamy, and short life cycle. In the past two decades, extensive studies have been undertaken in plant-bacterial symbioses and fundamental molecular, physiological, and developmental biology of M. truncatula to translate and improve traits in economically important legume crops [5][6][7][8][9]. The release of the M. truncatula accession Jemalong A17 reference genome sequence and generation of the Tnt1-based insertion mutant population for accession R108 have greatly accelerated functional genomics studies in M. truncatula [10][11][12]. The M. truncatula reference genome was generated in A17, which is highly recalcitrant to transformation, whereas the Tnt1 mutant population was generated in R108, with a much higher transformation efficiency. Phylogenetically, R108 is one of the most distant M. truncatula accessions from A17 [13]. R108 is more attractive for genetic studies due to its high transformation efficiency [10]. Recently, R108 has become popular in legume research communities with its near-saturated Tnt1-insertion population, which is widely used in most areas of legume functional genomic analysis [10,12]. The Tnt1 insertion population comprises 21,700 regenerated lines, encompassing more than a half-million randomly distributed Tnt1 insertions [12]. Due to the lack of high-quality pseudomolecules (chromosomes) in R108, all Tnt1 insertions are mapped to the A17 genome. However, compared to R108 and other M. truncatula genotypes, A17 has a large (~30 Mb) reciprocal translocation between chromosomes 4 and 8 [4], resulting in inaccurate synteny analysis between M. truncatula and other legume genomes and aberrant recombination in genetic crosses, including crosses between A17 and R108 [14]. In addition, evolutionary whole-genome duplications [13,15] and frequent local duplications make genome assembly difficult. Therefore, having two high-quality references in M. truncatula will allow us to perform more accurate synteny analysis and comprehensive genome-scale comparisons, and calls for the M. truncatula cv. R108 genome sequence.
Three years ago, the first draft assembly of M. truncatula cv. R108 was constructed using a combination of PacBio, Dovetail, and BioNano technologies, as described by Moll et al. (2017). Recently, we and others significantly improved draft genomes using data derived from in situ Hi-C [16][17][18][19]. As Hi-C can estimate the relative proximity of loci in the nucleus, Hi-C contact maps can be used to correct misjoins, anchor, order, and orient contigs and scaffolds. This process improves contig accuracy and typically yields chromosomelength scaffolds. To broaden the range of genetic resources available for the model legume M. truncatula, we used Hi-C to improve the R108 v1.0 draft assembly, producing a genome assembly for M. truncatula cv. R108 with chromosome-length scaffolds. Approximately 387,000 flanking sequence tags (FSTs), identified from approximately 21,000 Tnt1 insertion lines of M. truncatula cv. R108, were mapped onto pseudo-chromosomes of R108.

Assembly of M. truncatula Accession R108 with Chromosome-Length Scaffolds
The first draft assembly of M. truncatula cv. R108 was constructed using a combination of PacBio, Dovetail, and BioNano technologies [20]. The resulting assembly (R108 v1.0) comprised 402 Mb of sequence (contig N50 length: 5.93 Mb) partitioned among 909 scaffolds. Here, we generated in situ Hi-C data [16,18] from M. truncatula cv. R108 leaves to improve its initial draft assembly [19,21]. Scaffolds/contigs shorter than 1 Kb were not anchored from the R108 v1.0 assembly; the remaining scaffolds were anchored, ordered, oriented, and corrected for misjoins using the Hi-C data. After manual refinement using Juicebox Assembly Tools [21], as shown in Figure 1, the resulting assembly, named MedtrR108_hic, was represented by 801 scaffolds, of which eight were chromosome-length scaffolds (N50 length of 51.86 Mb), ranging from 37.80 to 55.90 Mb. The chromosome-length scaffolds (N50 length of 51.86 Mb), ranging from 37.80 to 55.90 Mb. The chromosomelength scaffolds spanned 97.62% of the sequenced bases in the entire assembly. The remaining 793 scaffolds (N50 length of 18.96 Kb) constituted the remaining 2.38% of the total assembly. The circular snail plots describing the assembly statistics of MedtrR108_hic and R108 v1.0 are shown in Figure 2a,b. These results are further summarized in Table 1. Additional assembly statistics can be found in Tables S1-S5. Figure 1. Hi-C map of the draft and chromosome-length assemblies of Medicago truncatula cv. R108 genome. Contact matrices were generated by aligning the same Hi-C data set to the R108 v1.0 draft genome (left) and MedtrR108_hic genome assembly generated using Hi-C (right). Pixel in- Table 108. hic are assigned a linear color gradient; the same colors are then used for the corresponding loci in the R108v1.0 (left). The draft scaffolds are ordered by sequence name. Gridlines highlight the boundaries of eight chromosome-length scaffolds in MedtrR108_hic (right). Scaffolds smaller than 10 kb in R108v1.0 are not included. Note the larger values for the longest scaffolds, N50 and N90, for MedtrR108_hic than R108 v1.0. The plots were generated using https://github.com/rjchallis/assembly-stats.

Genome Annotation and Functional Characterization
Reannotation of the MedtrR108_hic genome assembly predicted 39,027 high-confidence, protein-coding genes, which is lower than the 55,706 and 44,623 protein-coding genes annotated in the R108 v1.0 (GenBank accession no. GCA_002024945.1) and A17 Mt5.0 assemblies (GenBank accession no. GCA_003473485.2), respectively [14,20]. However, assessment of gene space completeness via a Benchmarking Universal Single-Copy Orthologs (BUSCO) analysis [23] showed that the MedtrR108_hic assembly harbored a higher percentage of complete BUSCOs (96.73%) than the R108 v1 assembly (91.94%) among the 2326 BUSCO groups searched (Table S7). The percentages of fragmented and missing BUSCOs were also less in the MedtrR108_hic assembly than the R108 v1.0 assembly. Further, the number of complete BUSCOs (single copy and duplicated) were more comparable between MedtrR108_hic and A17 Mt5.0 than between R108 v1 and A17 Mt5.0.
The MedtrR108_hic protein-coding genes were also mined for protein domains and annotated with gene ontology (GO) terms. A total of 29,504 (75.60%) and 19,086 (48.90%) genes had at least one protein domain and GO term assigned to them, respectively (Table S10). All publicly available RNA-Seq accessions used for annotation are presented as supplementary information (Table S11).
Of the 25,548 syntenic genes identified in A17 Mt5.0 versus MedtrR108_hic, 2,676 (10.47%) genes were found in the translocated regions (Table S14); of which, 1,143 were found in the 12 Mb region, with the remaining 1,533 genes in the 17 Mb region (Table S14). Of the 26,348 syntenic genes identified between A17 Mt5.0 and R108 v1.0, 1,590 (6.03 %) genes were found in the translocated regions; most of which (1,159) were found in the 12 Mb region, while the remaining 431 genes were in the 4 Mb region (Table S14). The GO terms commonly enriched in the translocated regions between A17 Mt5.0 and Med-trR108_hic (Table S15) and A17 Mt5.0 and R108 v1.0 (Table S16) comprised the following stress-response related-terms: response to water deprivation, plant-type hypersensitive response, response to ethylene, response to abscisic acid, and response to jasmonic acid (Table S17). Overall, the A17 Mt5.0 versus MedtrR108_hic syntenic genes could be arranged in a smaller number of larger blocks than the A17 Mt5.0 versus R108 v1.0 syntenic genes. A total of 25,548 syntenic genes were identified between A17 Mt5.0 and MedtrR108_hic (Table S12), which could be arranged in 59 collinear blocks. The largest block (no. 54) contained 2574 genes, while the smallest block (no. 49) contained four genes (Table S13). In contrast, 26,348 syntenic genes were identified between A17 Mt5.0 and R108 v1.0 (Table S14), which could be arranged into 121 collinear blocks. The largest block (no. 66) contained 1535 genes, while the smallest block (no. 73) contained four genes (Tables S12-S14).

Mapping Tnt1 insertion Sites in the M. truncatula R108 Hi-C Genome Assembly
Of the 25,548 syntenic genes identified in A17 Mt5.0 versus MedtrR108_hic, 2676 (10.47%) genes were found in the translocated regions (Table S14); of which, 1143 were found in the 12 Mb region, with the remaining 1533 genes in the 17 Mb region (Table S14). Of the 26,348 syntenic genes identified between A17 Mt5.0 and R108 v1.0, 1590 (6.03 %) genes were found in the translocated regions; most of which (1159) were found in the 12 Mb region, while the remaining 431 genes were in the 4 Mb region (Table S14). The GO terms commonly enriched in the translocated regions between A17 Mt5.0 and MedtrR108_hic (Table S15) and A17 Mt5.0 and R108 v1.0 (Table S16) comprised the following stressresponse related-terms: response to water deprivation, plant-type hypersensitive response, response to ethylene, response to abscisic acid, and response to jasmonic acid (Table S17).

Mapping Tnt1 Insertion Sites in the M. truncatula R108 Hi-C Genome Assembly
From the 21,741 Tnt1 insertion lines generated in M. truncatula cv. R108, 392,396 FSTs were recovered using TAIL-PCR and Sanger or Illumina sequencing [12]. The average sequence length of these FSTs is 363 bp. To identify the signature sequence in FSTs, we processed all FSTs and obtained 221,275 high-confidence FST sequences. The remaining 171,121 FSTs that lacked signature sequence likely resulted from AD primer end sequencing. Of the 221,275 FSTs, 202,788 (92%) were successfully mapped to the M. truncatula R108 reference genome MedtrR108_hic with an identity greater than 90% (Table 2). A total of 201,427 Tnt1 insertions were mapped to eight chromosomes, with an average of 25,178 insertions per chromosome ( Table 2). The most Tnt1 insertions (27,902; 12.6%) were mapped onto chromosome 1 and the least (16,433; 7.4%) were mapped onto chromosome 6 ( Table 2). It is reasonable to observe low numbers on chromosome 6 as it is the smallest of the eight chromosomes. In addition, 1361 Tnt1 insertions were mapped onto the unanchored scaffolds. The mapping of Tnt1 insertions across all R108 chromosomes confirms previous results that showed random Tnt1 insertions based on the M. truncatula A17 genome [12]. All Tnt1 insertions were mapped to chromosomes in the Hi-C assembly of the R108 genome based on physical chromosome location through circos genome plots (Figure 4). From the 21,741 Tnt1 insertion lines generated in M. truncatula cv. R108, 392,396 FSTs were recovered using TAIL-PCR and Sanger or Illumina sequencing [12]. The average sequence length of these FSTs is 363 bp. To identify the signature sequence in FSTs, we processed all FSTs and obtained 221,275 high-confidence FST sequences. The remaining 171,121 FSTs that lacked signature sequence likely resulted from AD primer end sequencing. Of the 221,275 FSTs, 202,788 (92%) were successfully mapped to the M. truncatula R108 reference genome MedtrR108_hic with an identity greater than 90% (Table 2). A total of 201,427 Tnt1 insertions were mapped to eight chromosomes, with an average of 25,178 insertions per chromosome ( Table 2). The most Tnt1 insertions (27,902; 12.6%) were mapped onto chromosome 1 and the least (16,433; 7.4%) were mapped onto chromosome 6 ( Table 2). It is reasonable to observe low numbers on chromosome 6 as it is the smallest of the eight chromosomes. In addition, 1,361 Tnt1 insertions were mapped onto the unanchored scaffolds. The mapping of Tnt1 insertions across all R108 chromosomes confirms previous results that showed random Tnt1 insertions based on the M. truncatula A17 genome [12]. All Tnt1 insertions were mapped to chromosomes in the Hi-C assembly of the R108 genome based on physical chromosome location through circos genome plots (Figure 4).

Comparison of Tnt1 Insertions Using M. truncatula R108 Hi-C or A17 v5.0 Genic Regions and Functional Annotation of Genes with Insertions
In the M. truncatula A17 v5.0 reference genome [14], 44,624 genes were predicted and annotated. Our Hi-C assembly predicted 39,027 genes in M. truncatula R108. From 202,788 high-confidence FSTs (Table S18), there were 24,052 genes with exact Tnt1 insertion sites (61.62%) in the R108 Hi-C assembly (Tables S19-S20). We found a similar percentage of genes (60%; 26,717 genes) in the M. truncatula A17 v5.0 reference genome with Tnt1 insertion in at least one gene (Table S21). A list of the GO annotations analyzed for genes with Tnt1 insertions and the gene groups are summarized in Table S22. In the R108 Hi-C version, there were at least 19,008 genes (48.7%) with more than one Tnt1 insertion, contrasting with 18,352 genes (41.12%) in the M. truncatula A17 v5.0 reference genome (Tables S20-S21). There were at least 12,746 genes (32.65%) with at least four Tnt1 insertions in the R108 Hi-C assembly, contrasting with 22.29% of the genes (9949 genes) in the M. truncatula A17 v5.0 reference genome (Tables S20-S21). An average of 4.07 Tnt1 insertions per gene was observed in the MedtrR108_hic assembly compared to 4.33 insertions per gene in M. truncatula A17 v5.0 (Tables S20-S21).
The most frequently hit gene when M. truncatula A17 v5.0 genome was used for analysis is MtrunA17Chr5g0441701 (putative peptidyl prolyl isomerase), with 135 Tnt1 insertions (Table S20), while the two genes with more Tnt1 insertions when MedtrR108_hic assembly was used for analysis are MedtrR108_hic. Hi-C_scaffold_8.3452 (Eukaryotic and viral aspartyl proteases active site protein) and MedtrR108_hic. Hi-C_scaffold_2.2064 (RHN73856.1 putative FAS1 domain-containing protein) with 143 and 139 Tnt1 insertions, respectively (Table S21). The genes that did not have insertions were also identified (Tables S20-S21). It is reasonable to assume that Tnt1 in the existing insertion population disrupts majority of genes in the M. truncatula genome. GO ontology and annotation were performed for all genes with frequent insertions and insertions into genes with less frequency (Table S22).
AgriGO v2.0 [28] analysis was used to enrich the frequently inserted 7737 genes in GO categories, which were selected based on genes that are inserted more than the average insertion number (i.e., 4.33 insertions per gene). The results showed that these frequently inserted genes fall into the following five pathways: stress, signaling, secondary metabolism, transport, and nucleotide metabolism (Table S22 and Figure S1A). The significant GO terms under the biological processes are response to stress, response to stimulus, defense response, protein phosphorylation, and transmembrane transport (Table S22). The significant GO terms under molecular functions are ATP binding, active transmembrane transporter activity, protein tyrosine kinase activity, and transporter activity ( Figure S1B). The GO enrichment analysis revealed similar results to the pathway analysis and corresponded with the previously reported data [12].

Tnt1 Insertions in Genes in the Syntenic Regions
Syntenic regions between A17 v5.0 and R108 v1.0 genome were obtained from the publicly released v1.0 [14]. The syntenic region between A17 v5.0 and MedtrR108_hic syntenic genes could be arranged into a smaller number of larger blocks than the A17 Mt5.0 versus R108 v1.0 syntenic genes. A total of 25,548 syntenic genes were identified between A17 Mt5.0 and MedtrR108_hic (Table S12), which could be arranged in 59 collinear blocks. The largest block (no. 54) contained 2574 genes, while the smallest block (no. 49) contained four genes. We identified 17,766 genes present in all syntenic blocks combined between A17 and R108 (Table S23). Each of the Tnt1 genic insertions in the syntenic regions and the GO annotation is presented in Table S22. Individual gene numbers from each block are identified and presented as a supplemental table (Table S24). Six syntenic blocks (54, 25,9,36,12, and 31) have more than 1000 genes with Tnt1 insertions (Table S24). The highest number of genic Tnt1 insertions are in Block 54 with 1787 genes (Tables S23-S24).

Discussion
The MedtrR108_hic assembly is a significant improvement on the R108 v1.0 assembly, with its smaller number of larger scaffolds, higher scaffold N50 value and improved CEGMA results. While fewer genes were annotated in the Hi-C assembly, the gene content appeared to be more complete than the R108 v1.0 annotation, as reflected in the BLAST and BUSCO results for MedtrR108_hic [20] processed through the MAKER-P pipeline [29] for annotation; only ab initio gene predictions from RNA-Seq alignments were used as the source of evidence. In the current study, a combination of ab initio gene predictions from RNA-Seq alignments and evidence from protein homology studies were used for annotation via the BRAKER2 [30] pipeline and EvidenceModeler [31]. The latter tool primarily leverages the ab initio predictions as its source of gene model components, and then leverages the protein and transcript alignment data to guide its choice of best models. Therefore, any ab initio predictions not supported by the protein/transcript alignments are discarded. This strict filtering could explain why we observed a reduction in the number of genes. Additionally, the RNA-Seq libraries used to annotate the Hi-C assembly were derived from root tissue [20] and leaf tissue (data generated in-house). However, the R108 v1.0 assembly was annotated using RNA-Seq data from root tissue only [20]. This could explain why the current annotation is more complete.
The abnormal conformation of chromosomes 4 and 8 in genotype A17 is well-known [14,27]. The smaller number of larger collinear blocks identified between A17 Mt5.0 and MedtrR108_hic, coupled with the larger 17 Mb translocation between A17 Mt5.0 chromosome 4 and MedtrR108_hic chromosome 8, reflects the more contiguous nature of the Hi-C assembly than R108 v1.0. Furthermore, the absence of the three additional breakpoints (BKPT 2, 3, and 4) identified by Pecrix et al. in the Hi-C assembly when comparing A17 Mt5.0 and R108 v1.0 suggests that these breakpoints occurred as a result of the more fragmented nature of the R108 v1.0 assembly or the presence of errors in the assembly. Therefore, it is unlikely that these breakpoints represent true structural variations in A17. On the other hand, the inversion in A17 Mt5.0 for both MedtrR108_hic and R108 v1.0 indicates that this structural variation is real and constitutes a second distinctive structural feature of the A17 genotype. This inversion was also visible when A17 Mt5.0 was compared with the genetic maps of Medicago sativa and Pisum sativa, the species most closely related to M. truncatula [14].
Tnt1 insertion lines have become more and more popular due to their powerful, versatile applications in forward and reverse genetics. The Tnt1 lines were generated in the R108 background due to its high transformation and regeneration efficiency. The A17 and R108 genomes significantly differed due to their phylogenetic distance [32]. Though most genes in both genomes have high similarity, there are a significant number of genes that have moderate similarity, which will cause ambiguity in determining whether a BLAST search result of a gene with the A17 sequence is a true hit in the Tnt1 FST database. Therefore, a high-quality R108 genome assembly was needed. Compared to the genome R108 v1.0, the assembly quality of MedtrR108_hic has significantly improved, especially in the syntenic translocation regions, where Tnt1 FST mapping is more accurate in the MedtrR108_hic genome.
Genome-editing technology, especially Clustered Regularly Interspaced Short Palindromic Repeat/CRISPR-associated protein 9 (CRISPR/Cas9) technology, has become more powerful and applicable to many plant species, including M. truncatula. CRISPR/Cas9 is an innovative technology, offering excellent opportunities for plant genetics and functional genomics research. Its advantages include target specificity, effectiveness, precision, and feasibility for multiple genome manipulation options [33]. Accurate plant gene sequences are critical for gene editing. The improved genome editing efficiency in M. truncatula [34] should increase CRISPR/Cas9 technology use. Due to significant differences in the transformation efficiencies between A17 and R108, R108 is the first choice for genome editing practices. The improved genome assembly of R108 provides a solid foundation for future genome editing research in the legume community.

Hi-C Library Preparation and Sequencing
In situ Hi-C was performed as described previously [18] using frozen leaves from Medicago truncatula cv. R108. Briefly, frozen leaf tissue was crosslinked, ground and then lysed with nuclei permeabilized but still intact. DNA was then restricted with MboI restriction enzyme and the overhangs filled in incorporating a biotinylated base. Free ends were then ligated together in situ. Crosslinks were reversed, the DNA was sheared to 300-500 bp and then biotinylated ligation junctions were recovered with streptavidin beads.
Standard Illumina library construction protocol was used for DNA sequencing. Briefly, DNA was end-repaired using a combination of T4 DNA polymerase, Escherichia coli DNA Pol I large fragment (Klenow polymerase), and T4 polynucleotide kinase. The blunt, phosphorylated ends were treated with Klenow fragment (3 to 5 exo minus) and dATP to yield a protruding 3-'A' base for ligation of Illumina's adapters which have a single 'T' base overhang at the 3' end. After adapter ligation, DNA was PCR amplified with Illumina primers for 14 cycles and library fragments of 400-600 bp (insert plus adaptor and PCR primer sequences) were purified using SPRI beads. The purified DNA was captured on an Illumina flow cell for cluster generation. Libraries were sequenced on the NextSeq500 following the manufacturer's protocols. The same R108 lineage used for generating Tnt1 insertion lines was used for Hi-C. The resulting library was sequenced to yield approximately 48× coverage of the M. truncatula genome.

Genome Assembly
The Hi-C library was processed against the R108 v1.0 genome assembly [20] the Juicer pipeline [35]. The assembly was performed as described [19,21]. Briefly, after excluding scaffolds shorter than 1 Kb, the 3D De Novo Assembly (3D-DNA) pipeline was run using the in situ Hi-C data to anchor, order, orient, and correct misjoins in the R108 v1.0 scaffolds. Lastly, a manual refinement step was performed using Juicebox Assembly Tools [21]. The resulting contact maps were visualized using the 3D-DNA and Juicebox visualization system [19,21,36].
Finally, EvidenceModeler [31] v1.1.1. was used to combine the gene predictions from GeneMark-ET and AUGUSTUS, protein alignments from Exonerate, and the assembled transcripts from Stringtie to obtain the final gene set.

Assessment of Genome Assembly and Annotation Quality
Assessment of the Hi-C genome assembly quality and completeness was performed via CEGMA [22] v2.5. to identify the presence of CEGs. BUSCO [23] v4.14. was run using the eudicotyledons_odb10 dataset in protein mode to evaluate the annotation quality.

Mapping of Tnt1 Insertion Lines and Functional Gene Group Analysis
To accurately identify Tnt1 insertion sites in the M. truncatula genome, all FST sequences shorter than 50 bp, without the Tnt1 signature sequence ('CCCAACA,' 'CAT-CATCA' or 'TGATGATGTCC'), or the Tnt1 signature sequence outside the 28 bp from the beginning or end of the FST sequence were discarded. The preprocessed reliable FST sequences were aligned to the M. truncatula A17 version4 (Mt4.0) or version5 (Mt5.0) and R108 Hi-C assembly reference genomes using BLASTN with an e-value threshold ≤1.00 × 10 −5 . The FST sequences with best hit from BLAST analysis were further processed for downstream analysis. Only hits with at least 90% sequence identity were considered and used for functional gene group analysis. Functional gene group analysis was performed as described elsewhere [12].

Conclusions
Using in situ Hi-C data, we improved the M. truncatula cv. R108 genome assembly by correcting misjoins and ordering and orienting scaffolds to generate eight chromosomelength large scaffolds that correspond to the eight chromosomes in the A17 reference genome. Compared to the previous version (v1.0) of the R108 genome, the newly assembled MedtrR108_hic genome is a significant improvement due to its smaller number of larger scaffolds, higher scaffold N50 value, and improved CEGMA results. MedtrR108_hic also provides insight into how to accurately predict syntenies in the chromosome 4/8 translocation regions between A17 and R108. Furthermore, mapping the Tnt1 insertion landscape onto the current reference assembly provides a much-needed foundational resource for functional genomics studies in the legume community.

Patents
O.D., M.P., C.L, and E.L.A. are inventors on U.S. provisional patent application 62/347,605 filed 8 June 2016, by the Baylor College of Medicine and the Broad Institute, relating to the assembly methods in this manuscript.