Studying the Genetic Diversity of Yam Bean Using a New Draft Genome Assembly

: Yam bean ( Pachyrhizus erosus Rich. Ex DC. ) is an underutilized leguminous crop which has been used as a food source across central America and Asia. It is adapted to a range of environ ‐ ments and is closely related to major leguminous food crops, offering the potential to understand the genetic basis of environmental adaptation, and it may be used as a source of novel genes and alleles for the improvement of other legumes. Here, we assembled a draft genome of P. erosus of 460 Mbp in size containing 37,886 gene models. We used this assembly to compare three cultivars each of P. erosus and the closely related P. tuberosus and identified 10,187,899 candidate single nucleotide polymorphisms (SNPs). The SNP distribution reflects the geographic origin and morphology of the individuals.


Introduction
Yam bean (Pachyrhizus erosus Rich. Ex DC.) is a diploid leguminous crop grown for its starchy tuberous root, which is high in vitamins and minerals, particularly vitamin C, iron, zinc and potassium [1]. The genus Pachyrhizus belongs to the subtribe Glycininae, which includes the major legume crop, soybean (Glycine max) [2]. Yam bean is a regionally important crop in Mexico and Southeast Asia, where it is popular in traditional dishes or eaten raw. It produces a high yield in a small area of farmland (Mexico produces ~160 tons per hectare) and flourishes in humid conditions [3].
Yam bean has not been extensively studied, however, it has significant potential for improvement, both as a durable leguminous food crop for major agro-ecologies [4], as well as a potential donor of traits to related major legumes, such as insect resistance and abiotic stress tolerance [5]. Three species of yam bean have been cultivated for human consumption; Mexican yam bean (P. erosus), Andean yam bean (P. ahipa) and Amazonian yam bean (P. tuberosus). These three species can intercross to produce fertile, interspecific hybrids [6].
P. tuberosus and P. erosus are the most widely consumed species of yam bean and each of these has diverse morphologies within the species and the genus. P. tuberosus plants are generally larger than other Pachyrhizus species, with plump and kidney-shaped seeds, and wing and keel petals covered in minute hairs [7]. The skin colour of the root of P. tuberosus varies from whitish cream to brownish grey depending on the cultivar, and can have elongated buddings or more round offshoots. In contrast, P. erosus has petals that are smooth and free from hair, with flat and square seeds and pods that are either covered in short, stiff hairs or smooth at maturity [8]. P. erosus is primarily beige in colour and different cultivars have varying shapes, ranging from smooth and circular to more irregular, flower-shaped base (Supplementary Figure S1).
An early study on the relationship between cultivated Pachyrhizus taxa used restriction fragment analysis on chloroplast DNA (cpDNA) and Random Amplified Polymorphic DNA (RAPD) molecular markers [9]. This established the separation of the genus into two evolutionary branches, Mesoamerican and South American, reflecting the pattern of the species' distribution. Phylogenies constructed from sequence variants of the internal transcribed spacer (ITS) region of ribosomal DNA supported the cpDNA phylogeny [9]. These phylogenies also suggest that P. panamensis, P. tuberosus and P. erosus may have originated from rapid radiation of a continuously distributed ancestor, diverging due to varying climates, domestication and human selection [9]. P. tuberosus was found to be likely ancestral to P. erosus as a separate lineage, although this was not conclusive [9].
More recently, Delêtre et al. [10] developed simple sequence repeat (SSR) loci to investigate the intraspecific diversity and interspecific relationships between members of the Pachyrhizus genus and found loci across all yam bean species that showed high levels of polymorphism. The markers were able to distinguish varietal groups within each species and showed P. ahipa possessed a relatively low level of genetic variability. Santayana et al. [11] examined correlations between environmental factors and phylogeographic diversity patterns. The study also identified two distinct lineages between the Andean and Amazonian landraces using cpDNA sequences and SSR molecular markers, and the authors came to similar conclusions as Englemann et al. [9] that in Pachyrhizus, climate and genotypes are closely correlated.
In addition to the molecular analyses, Zanklan et al. [12] performed a genetic diversity study in cultivated yam bean using quantitative and qualitative physical traits, including size, weight and yield of various parts of the plant. The study documented the percentage of variation in each of the plants' features, such as storage root, seeds, pods, flowers, stems and leaves, linking some of the observed diversity to geographical origin. Overall, the study identified a high level of diversity on the intraspecific scale but overall lower diversity on the interspecific level. Wide differentiation was found among accessions and lines between the three tested yam bean species. The close relationship among P. erosus, P. tuberosus and P. ahipa indicated that only eight highly heritable characters are needed to describe the diversity within yam bean.
Here, we describe the first published draft genome assembly of Pachyrhizus erosus, one of the Mesoamerican yam bean species and show how it can be used to infer phylogenetic relationships. We re-sequenced two species of yam bean representing each one of the two evolutionary branches [9], P. erosus and P. tuberosus (representing the South American branch) and compared three cultivars of each. We also compared the gene content with related legumes. The assembly produced can be used as a foundation for future work and our results can be used as a stepping stone for describing the genetic content of yam bean.

Plant Material
P. erosus for the reference assembly was grown from seeds obtained from Eden Seeds

DNA Extraction and Sequencing
Leaf tissue was collected and flash frozen for DNA extraction. DNA was extracted from only one plant for the reference as well one plant for each of the 3 cultivars from the CIP P. erosus and P. tuberosus cultivars using the Qiagen DNeasy Plant kit (Qiagen, Hilden, Germany). The DNA concentration was determined with the Qubit fluorometer (Invitrogen, Carlsbad, CA, USA). DNA quality was confirmed using the LabChip GX Touch (PerkinElmer, Waltham, MA, USA) using the HT DNA gDNA reagents and the LabChip GX Touch (PerkinElmer, Waltham, MA, USA) while DNA purity was confirmed using the NanoDrop 1000 (Thermo Fisher Scientific, Waltham, MA 02451, USA).
Library preparation and sequencing of the yam bean reference was carried out by the Kinghorn Centre for Clinical Genomics Sequencing Facility (Darlinghurst, NSW, Australia). Sequencing libraries were prepared using the TruSeq Nano DNA HT Library Preparation Kit (

Data Generation and Validation
The yam bean reference genome was assembled using MaSuRCA v3.2.2 (Maryland Super-Read Celera Assembler) (University of Maryland, MD, USA) [13] and using a Jellyfish hash size of 50,000,000 with a mean insert size of 400 bp and a standard deviation of 100 bp [14]. The data were assembled using a k-mer size of 31 [15]. A custom script was then used to remove contigs that were shorter than 1 kbp. GenomeScope V.2 (Cold Spring Harbor Laboratory, NY, USA) was used to predict the size of the assembly with a k-mer size of 31 bp and an expected coverage of 230× [16].
BUSCO v4 (Benchmarking Universal Single-Copy Orthologs) (SIB, Lausanne, Switzerland) was used to benchmark the quality of the assemblies by calculating the number of orthologues present [17]. The provided script, run_BUSCO.py, was used to compare the genome assembly with the BUSCO plant reference, embryophyta_odb9 using the -m geno flags. Results were visualised using the generate_plot.py script with the -wd flags.

Annotation
AUGUSTUS [18] was used to annotate the genome using the Arabidopsis extrinsic detection files from the program Exonerate [19]. A custom script, peptideripper.pl, made a peptide file from this annotation, which was processed through OrthoFinder [20]. The custom script, orthogroup_filter.py was used on the produced file, Orthogroups.tsv to obtain the designations of the high-quality genes. The high-quality genes were extracted from the orthogroup sequences using awk and compared with the UniRef database [21] using diamond BLASTp with an E-value of 0.05 [22].
OrthoFinder v2.2.7 was used to compare the yam bean annotation with other legume annotations [20]. The orthogroup file was filtered to remove the orthogroups that only contained genes from yam bean, and UpSetR [24] was used to visualise the orthogroup size with respect to species. A randomly chosen orthogroup containing only single copy orthologs was aligned with MUSCLE (multiple sequence comparison by log-expectation) (Mill Valley, CA 94941, USA) [25] and clustered using the pal2nal.pl script. The file was converted into a Phylogenetic Analysis by Maximum Likelihood (PAML) file and displayed using the interactive Tree Of Life (iTOL) program [26] (Figure 1).

Figure 1. A phylogeny tree based on orthologous protein groups identified in OrthoFinder. Yam bean (Pachyrhizus erosus)
is highlighted in green. Out of the legume species selected, yam bean is predicted to be most closely related to Glycine max.
The tools in the package Bcftools were used to perform variant calling (standard settings). Mpileup was used to produce a BCF file with all locations in the genome to call genotypes and produce a list of variant sites. The produced statistics were visualised using the provided script plot-vcfstats. A phylogeny tree was produced from the bedtools vcf file using the R programs gdsfmt and SNPRelate [31] (Figure 2).

Results and Discussion
A total of 145 Gbp of Illumina HiseqX sequencing data were generated for the reference. We estimated the Pachyrhizus erosus genome size to be 560 Mbp based on a k-mer size of 31. This is similar to the 550 Mbp predicted based on flow cytometry measurements [32]. Assembly of the Illumina sequence data resulted in contigs totalling 460 Mbp, with an N50 of 18,412, representing 83.6% the predicted genome size (Supplementary Table  S1). BUSCO analysis identified 1329 (94.9%) of the 1440 embryophyta_obdb9 plant orthologues in the assembly (Supplementary Table S2), suggesting that while the assembly is fragmented, it contains a similar level of completeness as comparable legume assemblies such as pea [33], Glycine max [34], Medicago sativa [35] and Vigna subterranea [36]. Annotation of the assembly predicted 37,886 genes, similar to other related diploid legume genomes, and a total of 27,575 (72.78%) of the encoded proteins share sequence identity with predicted proteins from other legumes.
To assess diversity within and between Pachyrhizus species, we generated a total of 17 Gbp of sequence data each for three P. erosus individuals, Pe-CIP-209016, Pe-CIP-209046 and Pe-CIP-209051 originating from Guatemala, Costa Rica and Mexico respectively, as well as three Peruvian P. tuberosus individuals, Pt-CIP-209013, Pt-CIP-209014 and Pt-CIP-209015 (Table 1). Mapping this data to our reference assembly identified 10,187,899 single nucleotide polymorphisms (SNPs) across the genome. Pachyrhizus is within the Millettioid/Phaseoloid clade from the Faboideae subfamily of the Phaseoleae tribe, which also encompasses members of the genera Glycine, Cicer, Cajanus, Phaseolus and Vigna [37]. We generated a phylogenetic tree comparing the yam bean peptide sequences to other Phaseoleae species (Glycine max, Glycine tomentella, Glycine dolichocarpa, Glycine sydetika, Glycine soja, Glycine gracilis, Phaseolus vulgaris, Vigna unguiculate, Lupinus angustifolious and Sphenostylis stenocarpa). The tree supports Glycine max (and, by extension, the Glycine family) as being the closest major crop relative to yam bean, which had also been shown using chloroplast sequences [38] (Figure 1). A second tree was constructed based on k-mer sketches comparing yam bean to related leguminous species ( Figure 3) and further supported this relationship. This corresponds with previous analyses based on plastid genome and DNA sequences, which supports the assertion that Pachyrhizus is closely related to Glycine [39]. A strict consensus cladogram generated from 7006 equally parsimonious Wagner cpDNA trees placed Pachyrhizus, then in the subtribe Diocleinae, as closely related to the subtribe Glycininae [40] and suggested that it should be placed into the Glycininae subtribe, which is supported by Polhill's [41] cpDNA restriction study. Maximum likelihood phylogeny trees of legume species generated from 71 protein-coding genes also inferred that Pachyrhizus was closely related to Glycine, despite not being monophyletic with it [42]. The SNP cluster tree distinctly grouped the six individuals by the species of yam bean ( Figure 2). SNP-based dissimilarity suggests that the two yam bean species have small but distinct differences. P. tuberosus and its relative P. ahipa are the only two South Americanoriginating species in the Pachyrhizus genus [43]. The clustering performed is similar to the results of Tapia and Sørensen [43], showing that P. tuberosus has been cultivated for thousands of years in South America, with accessions demonstrating little morphological variation.
There is still uncertainty regarding the origin of current landraces and cultivars of P. erosus in Mexico and Guatemala, as P. erosus has no clear subgroups related to its geographical origin [9,44]. Central American and Mexican landraces of P. erosus were cultivated by the Mayans and Aztecs but appear to have different origins based on preliminary molecular analyses [5]. Our analysis suggests that within P. erosus, Pe-CIP-209016 could have diverged from Pe-CIP-209046 and Pe-CIP-209051 ( Figure 2).
Englemann [9] examined cpDNA genetic marker variation within and between Pachyrhizus taxa. The consensus trees suggested that P. tuberosus was involved in the early parentage of P. erosus, likely stemming from an early lineage of P. tuberosus. However, the results are inconsistent with an analysis of the cpDNA and nuclear internal transcribed spacer (ITS) sequence variation [9]. These consensus trees were not supported by our analysis, which does not suggest a direct origin of P. erosus from P. tuberosus (Figure 2).
Wild P. erosus species are found across a wide range of geographical habitats, including wet forests (Mexico), deciduous forests (Guatemala) and dry savannahs (Costa Rica) [9]. The geographic distribution of the P. erosus individuals was reflected in the SNP distribution. Pe-CIP-209046 (Costa Rica) was found to be the most evolutionarily distant of the six yam bean individuals examined, and it was the only species to come from a dry, arid area. However, this was not reflected in the phylogenetic tree ( Figure 3). A potential reason for this is that Pe-CIP-209046 is from Engelmann's [9] proposed Central American groups, where accessions are exclusively from Central America, and has SNPs shared with individuals also stemming from the group. Pe-CIP-209016 is likely from an accession that has both Central American and Mexican origins, making it distant from Pe-CIP-209046 and Pe-CIP-209051. Additional data may help to estimate the extent of the divergence between P. erosus from Costa Rica and P. erosus originating elsewhere.
Domestication of P. tuberosus first took place in the Peruvian Andes, and has a long history of being cultivated in various locations in South America, including Paraguay and Bolivia [5,7]. The Pachyrhizus genus demonstrates high levels of diversity within the species [5,9,12]. The P. tuberosus individuals in this study have been determined to be a more genetically similar group than the P. erosus individuals based on their SNP distribution.
The genetic and phenotypic diversity within P. erosus could be due to geographic isolation and domestication, which is in line with previous studies showing that yam bean genotypes are associated with their respective environment [5,6,10]. P. tuberosus and P. erosus also have genotype-by-environment interactions associated with fresh storage root yield [45] and crop yield [46]. Engelmann's work [9], shows that P. erosus accessions could be divided into four groups based on RAPD analysis: a group of P. erosus that originated from Central America only; two groups that are from Central America and Mexico; and a group that was exclusively made up of accessions from Mexico.
In conclusion, the P. erosus assembly has allowed us to infer phylogenetic links within the Pachyrhizus species and other legumes. The assembly has shown that P. erosus is more genetically distinct than P. tuberosus, and adds support to the Glycine family being the closest major crop relative to the Pachyrhizus species. Our analysis suggests that P. erosus from Costa Rica is not as evolutionarily distinct from other P. erosus species as expected, although more data are needed to support this conclusion.
Pachyrhizus species are mostly self-pollinating, but some crossbreeding does occur (2-4%), depending on the availability of pollinators [5]. While some research has been carried out on its genetic traits, more data are needed to further investigate the evolutionary origins of the plant and to identify genetic loci associated with adaptation and agronomic traits. Here, we have assembled the first draft genome of P. erosus and used it to study the phylogeny of geographically diverse Pachyrhizus lines. We determined the location of the Pachyrhizus genus in relation to other legumes and how geography affects the SNP distribution. We also annotated 37,886 genes and over 10 million SNPs. This draft genome assembly can be used as a foundation for future yam bean genomic analyses of this underutilised crop.
Supplementary Materials: The following are available online at www.mdpi.com/2073-4395/11/5/953/s1, Figure S1: Yam bean root images including the origin and accession numbers of each cultivar, which were grown and investigated at the University of Western Australia. S1) P. tuberosus (Pt-CIP-209013) from Peru. S2) P. tuberosus (Pt-CIP-209014) from Peru. S3) P. tuberosus (Pt-CIP-209015) from Peru. S4) P. erosus (Pe-CIP-209016) from Guatemala. S5) P. erosus (Pe-CIP-209046) from Cartago Costa Rica. S6) P. erosus (Pe-CIP-209051) from Mexico,  Funding: This work was undertaken with the assistance of resources provided at the Pawsey Supercomputing Centre and was partially funded by the Australia Research Council (Projects DP200100762 and LP160100030). C.T.F. was supported by an IPRS awarded by the Australian government. K.P. was supported by an Endeavour Research Fellowship, Australia Awards, Australian Government Department of Education and Training.
Institutional Review Board Statement: Not applicable.

Informed Consent Statement: Not applicable.
Data Availability Statement: Raw sequencing reads are available from BIOPROJ. The assembly and annotation is available from http://www.appliedbioinformatics.com.au/index.php/Yam_bean_assembly (accessed on 1 May 2021).