Genotyping-by-Sequencing to Unlock Genetic Diversity and Population Structure in White Yam ( Dioscorea rotundata Poir.)

: White yam ( Dioscorea rotundata Poir.) is one of the most important tuber crops in West Africa, where it is indigenous and represents the largest repository of biodiversity through several years of domestication, production, consumption, and trade. In this study, the genotyping-by-sequencing (GBS) approach was used to sequence 814 genotypes consisting of genebank landraces, breeding lines, and market varieties to understand the level of genetic diversity and pattern of the population structure among them. The genetic diversity among di ﬀ erent genotypes was assessed using three complementary clustering methods, the model-based admixture, discriminant analysis of principal components (DAPC), and phylogenetic tree. ADMIXTURE analysis revealed an optimum number of four groups that matched with the number of clusters obtained through phylogenetic tree. Clustering results obtained from ADMIXTURE analysis were further validated using DAPC-based clustering. Analysis of molecular variance (AMOVA) revealed high genetic diversity (96%) within each genetic group. A network analysis was further carried out to depict the genetic relationships among the three genetic groups (breeding lines, genebank landraces, and market varieties) used in the study. This study showed that the use of advanced sequencing techniques such as GBS coupled with statistical analysis is a robust method for assessing genetic diversity and population structure in a complex crop such as white yam.


Introduction
White yam or white guinea yam (Dioscorea roundata Poir.) is one of the most important staple tuber crops of West Africa [1], which is strongly associated with the food security, income, and social culture of >300 million people of this region, having a net value internationally of ≈$15 billion [2]. It belongs to the section Enantiophyllum and Dioscoreaceae family consisting of approximately 600 species distributed in the tropical and subtropical regions of the world [3]. D. rotundata is an allogamous, polyploid, dioecious species with a basic chromosome number of n = x = 20. The 'yam belt' consisting of six countries in West and Central Africa including Nigeria, Ghana, Bénin, Côte d'Ivoire, Togo,

Plant Materials
A total of 814 genotypes were used in this study that included 473 genebank landraces selected from the revised yam core collection [36], 314 breeding lines, and 27 popular market varieties or landraces collected from different markets across Nigeria and Ghana (Table S1). The genebank landraces and breeding lines were collected from the Genetic Resources Center (GRC) and from the Yam Breeding Unit (YBU) of the International Institute of Tropical Agriculture (IITA), Ibadan, Nigeria, respectively. Of the 462 genebank landraces collected and conserved at the Genetic Resources Center at IITA-Ibadan, Nigeria, 198 were from Nigeria and 166 were from Togo, while other countries were Benin (32), Ghana (29), Cote d'Ivoire (27), Guinea (7), and 1 accession each were from Burkina Faso, Equatorial Guinea, and Sierra Leone (Table S1).

DNA Extraction and GBS
Genomic DNA was extracted from 100 mg of fresh young leaves using the Qiagen DNeasy Pant Mini kit (Qiagen, Germantown, MD, USA) following the manufacturer's protocol. A Nanodrop 8000 spectrophotometer (Thermo Fisher Scientific, Waltham, MA, USA) was used to measure the quality of the DNA by comparing the 260 and 280 nm absorptions. DNA samples were further quantified using the Quant-iT™ PicoGreen®dsDNA assay kit (Invtrogen, Carisbad, CA, USA) and diluted to 50 ng/µL with 1 × TE buffer. About 30 µL of DNA for each genotype was sent in 96-well plates to the Institute for Genomic Diversity (IGD) at Cornell University, Ithaca, New York, where GBS was done using a 96-plex Pst I GBS protocol [29]. In brief, for each library, purified genomic DNA was first digested with the restriction enzyme PstI (New England Biolabs, Whitby, ON, Canada), and the ligation of customized adapters (barcodes) with T4 ligase was subsequently carried out. This was followed by PCR with flow-cell attachment site tagged primers. Single-end sequencing was performed using an Illumina HiSeq2000 (Illumina Inc. San Diego, CA, USA).

Processing of Illumina Raw Sequence Read Data, SNP Calling, and Filtering
The raw sequencing reads (of read length 1 × 100 bp) containing the barcode were sorted, de-multiplexed, and trimmed to the first 64 bases starting from the enzyme cut site. All the reads containing 'N' within the first 64 bases and tags with less than 64 bases were removed. We used the first draft of the Dioscorea rotundata reference genome [37] and Bowtie2 [38] to align the sequencing reads. We implemented the best practices of the GATK pipeline to call SNPs [39] GATK 2.4 version and GATK-UnifiedGenotyper were used in this study for the SNP calling. We used multi-sample variant calling by GATK-UnifiedGenotyper considering the large number of samples involved in the study. The SAM files generated from the alignment were converted to BAM format and sorted by name using SAMtools. The final variant calling was generated through GATK (2.4) (using HaplotypeCaller in the gVCF mode) and joint genotyping (using GenotypeGVCFs). The VCF file developed was filtered for using criteria of MAF (minor allele frequency) >0.05 and missing data >80% both at the genotypes and SNP markers level. Only bi-allelic SNP markers with genotype quality >20 and read depth >5 were retained after using Vcftools v.0.1.12b [40] and PLINK v1.07 [41] for filtering. The resulting SNPs were subjected to linkage disequilibrium (LD) and SNP markers in LD were removed, and a total of 3432 SNP markers were retained for all subsequent analysis.

Population Structure, Genetic Diversity, and Relationships
The filtered SNP data were used to assess the population structure and genetic diversity in white yam. A set of parametric and non-parametric methods including the model-based maximum likelihood estimation of ancestral subpopulations using ADMIXTURE [42], assumption-free discriminant analysis of principal components (DAPC) [43,44] and fixation index (Fst)-based population differentiation were used for analysis. Genetic diversity analysis was carried out using parameters such as minor allele frequency (MAF), polymorphic information content (PIC), expected heterozygosity (He), and observed heterozygosity (Ho) using R [45]. Additionally, genetic diversity was assessed by calculating Shannon-Weaver (H'), Simpson, inverse Simpson, and Pielou's evenness indices [46] for the three genetic categories (genebank landraces, breeding lines, and market varieties) using Vegan package in R [47].
A pairwise identity by state (IBS) genetic distance matrix was calculated from 3432 SNP markers using PLINK, and a critical distance threshold [48] was used to declare whether two genotypes are identical based on the pairwise distances between them. Then, the dissimilarity matrix was used to construct the network relationships among the three genetic groups using QGRAPH [49] implemented in R. The aim was to detect important key nodes representing genetic relationships between two genetic groups (breeding lines-market varieties, genebank landraces-breeding lines, and genebank landraces-market varieties) in the network. The results from ADMIXTURE was Agronomy 2020, 10, 1437 4 of 16 complemented with DAPC analysis using the R package 'adegenet' [50] in a two-step process. First, the optimal number of clusters was inferred using k-means analysis [51,52] of PCA (principal component analysis)-transformed genome-wide SNP data by varying the possible number of clusters from one to 100. Bayesian Information Criterion (BIC) was used to assess the best supported model as well as the number and nature of clusters. DAPC scatter plots were later developed on the clusters identified through k-means using the first 50 principal components. The information based on ADMIXTURE analysis was used to determine the most appropriate K, and accessions with membership proportions (Q-value) ≥70% were assigned to groups, while those with membership probabilities less than 70% were declared as admixtures. Then, the results from DAPC analysis and ADMIXTURE were compared. The coefficient of genetic differentiation among the population was calculated based on pairwise Fst (fixation index) to estimate the genetic distance and the relationships among the three genetic groups of D. rotundata used in the study. Analysis of molecular variance (AMOVA) was also conducted to assess the population differentiation among the three genetic groups used in the study. Then, a phylogenetic tree was built by following the procedure of IBS with 1000 bootstrap replicates in PowerMarker v3.25 [53]. The resulting tree was visualized in Molecular Evolutionary Genetics Analysis (MEGA) software version X [54].

SNP Summary
The FastQ files of the generated sequences were aligned to the D. rotundata reference genome [37], of which 43.4% tags aligned uniquely to the reference, 9.8% aligned to multiple positions, and 46.8% did not successfully align. Uniquely aligned tags were used for calculating the distribution of tag density at each position in D. rotundata genome and for SNP distribution. A total of 137,800 unfiltered SNPs was detected as raw SNP markers. A total of 3432 filtered SNPs were obtained and distributed across the twenty-one pseudo chromosomes of D. rotundata (Table 1; Figure 1). The genome-wide SNP density plot (Figure 1) revealed that the highest number of SNPs was in chromosome 5 (11.1%, 380 SNPs), while the lowest number of SNPs was mapped in chromosome 11 (2.6%, 90 SNPs). Then, the transition and transversion SNPs were calculated. Transition SNPs (66.9%, 2296 SNPs) were more frequent than transversions (33.1%, 1136 SNPs). The C/T transitions (34.3%) accounted for the highest frequency, while C/G transversions (5.1%) occurred at the lowest frequency among all the 3432 SNPs ( Figure 2). The average PIC value across all the markers was 0.135, while the observed heterozygosity ranged from 0.138 to 0.190 with an average of 0.165 (Table 1). The expected heterozygosity ranged between 0.133 and 0.190, and the mean was 0.161. Similarly, the minor allele frequency (MAF) ranged between 0.090 and 0.133 with an average of 0.111.

Population Structure and Genetic Diversity
Based on missing data (>20%), 11 genebank landraces of geographical origin of Nigeria were removed from further analysis. Among the 803 accessions analyzed, 314 were breeding lines that were generated through open pollination and bi-parental crossing over several years by the Yam Breeding Unit at IITA-Ibadan, Nigeria. The details are available in Table S1. The majority of the bi-parental crosses were made using a limited number of parental lines (see Table S1), and this could be attributed to the flowering behavior (dioecious, shy female flowering, limited seed set, among others) in white yam [56]. Being clonally propagated, the progenies of bi-parental crosses in white yam represented a segregating population (F2) and were genetically different from each other. This is evident from the grouping of progenies derived from same bi-parental crosses in different clusters (Table S1).
The genetic distance among 803 genotypes varied between 0 and 0.27 (Table S2). Based on the genetic estimation, two accessions were considered identical or representative of the same clone if their pairwise genetic distance was lower than 0.02. Based on this criterion, a total of 767 unique genotypes were recorded. To understand the pattern of population structure, a Bayesian Information Criterion (BIC) and complementary coordination analysis by DAPC were performed. The BIC results suggested the best clustering at K = 2 (with a probability of cluster membership assignment of 100) based on delta K values ( Figure 3A, Table S3). Clusters 1 and 2 consisted of 739 (309 breeding lines, 403 genebank landraces, and 27 market varieties) and 64 genotypes (five breeding lines and 59 genebank landraces) ( Figure 3B, Table S3), respectively. DAPC analysis was further carried out to assess the subclusters at K = 3 ( Figure 3C), K = 4 ( Figure 3D), K = 5 ( Figure 3E), and K = 12 ( Figure 3F). The summary of DAPC cluster grouping and probability of cluster membership assignment of genotypes at K = 2, 3, 4, 5, and 12 is presented in Table S2. Based on the probability of cluster membership assignment, DAPC clusters both at K = 2 and K = 3 represented a good fit. At K = 3, Cluster 1 consisted of 67 genotypes including five breeding lines, 59 genebank landraces majorly representing two countries such as Nigeria (28 landraces) and Togo (21 landraces), and three market varieties (Tables S1 and S3). Cluster 3 was the largest, consisting of 654 genotypes (genebank landraces: 372; breeding lines: 282) while Cluster 2 consisted of 82 genotypes representing 31 genebank landraces, 27 breeding lines, and 24 market varieties (Table S3). A comparative analysis of genetic diversity among the three genetic groups (genebank landraces, breeding lines, and market varieties) revealed that the PIC (0.141), Ho (0.173), and He (0.168) values were relatively higher for genebank landraces while they were the lowest for the market varieties (Table 2). Among the different genetic groups of D. rotundata, the Shannon-Weaver index and Simpson's index were the highest for genebank landraces, while the highest Pielou's evenness value was for the market varieties (Table 2).
A significant level of population divergence based on pairwise Fst (p < 0.0001) was also observed between different genetic groups (breeding lines, genebank landraces, and market varieties, while strong genetic relationships were observed within each group ( Table 3). The Fst-based population differentiation was highest among breeding lines and genebank landraces (0.038), and it was the minimum between genebank landraces and market varieties (0.024). The AMOVA analysis revealed that the variability was divided into 96% within genetic groups and 4% between the three genetic groups (Table 4).    To elucidate the clustering of 803 genotypes using the ADMIXTURE program, a varying number of subpopulations from K = 2 to 50 was plotted ( Figure S1). This resulted in the most appropriate number of subpopulations at K = 2, 3, 4, and 10. Figure 4 represented the genetic relationships among 803 genotypes as the estimated ancestries (Q) from ADMIXTURE analysis represented as barplot at K = 2, 3, 4, and 10. The summary of ADMIXTURE cluster composition at K = 2, 3, 4, and 10 is presented in Table S4. At K = 2, two major clusters were obtained consisting of 735 genotypes in Cluster 1 (Red) and 59 genotypes in Cluster 2 (Green) (Figure 4). After assessing the number of subpopulations (K) from 2, 3, 4, and 10, the most appropriate number was found to be K = 4 and K = 10, which produced the lowest cross-validation error compared to other K values (Figure 4 and Figure S1). At K = 4, 119 genotypes were found to be admixed consisting of 29 breeding lines, 82 landraces, and 8 market varieties. Similarly, at K = 10, the majority of the market varieties (17) were found to be admixed (Table S4).
Agronomy 2020, 10, x FOR PEER REVIEW 9 of 16 varieties. Similarly, at K = 10, the majority of the market varieties (17) were found to be admixed (Table S4). The results of ADMIXTURE-based clustering at K = 4 was strongly supported by the topology of the distance-based phylogenetic tree ( Figure S2). A major difference between the results of DAPC and ADMIXTURE clustering was the tendency of DAPC analysis to assign all genotypes to a single cluster compared to ADMIXTURE, which assigned admixed genotypes to multiple clusters based on K values (Tables S2 and S4). The results of ADMIXTURE-based clustering at K = 4 was strongly supported by the topology of the distance-based phylogenetic tree ( Figure S2). A major difference between the results of DAPC and ADMIXTURE clustering was the tendency of DAPC analysis to assign all genotypes to a single cluster compared to ADMIXTURE, which assigned admixed genotypes to multiple clusters based on K values (Tables S2 and S4).

Hierarchical Clustering and Network Analysis
A phylogenetic tree was further generated that grouped 803 genotypes into four major clusters ( Figure S1) with several subclusters within Cluster 1 (Green) and Cluster 2 (Red). Cluster 1 was the largest cluster consisting of a mixture of genebank landraces (400), market varieties (26), and breeding lines (311) corresponding to Cluster 3 of DAPC clustering at K = 3 and Clusters 3 and 4 of ADMIXTURE clustering at K = 4. Clusters 3 (Black) and 4 (Blue) were comparatively smaller groups consisting mainly very few breeding lines and market varieties, respectively. In addition, a network analysis was further carried out to assess the genetic relationship between the three genetic groups (breeding lines, genebank landraces, and market varieties). Although the phylogenetic tree grouped 803 genotypes into four distinct clusters, the grouping was unable to highlight any particular pattern for the three genetic groups under study. On the contrary, the network analysis among these genetic groups ( Figure 5) revealed a centralized structure and genetic contribution of each genotype within a genetic group (genebank landrace, breeding lines, and market varieties). The network analysis between breeding lines and market varieties showed no direct genetic relationship among them, although some of the breeding lines may have been used directly as market varieties ( Figure 5A), since there is a tendency with the farmers/traders to re-name the genotypes/breeding lines using their own naming system, which is associated with a popular market variety name. The naming system is also complex and depends upon the region within the country. For example, one of the popular market varieties in Nigeria is Hembakwase, which corresponds to several breeding lines (TDr 09/00023) and is associated with other market varieties such as Makakuasa and Omi_efun. Similarly, market variety Alumaco_1 was found to be similar to genebank landrace TDr 3584 and other market varieties such as Alumaco_2, TDr_Adaka, and TDr_Idu_Ekpeye. In addition, market variety TDr_Ehuru was identical to the breeding line TDr 08/00628. The network analysis between breeding lines and genebank landraces ( Figure 5B) showed a strong genetic relationships indicating the use of genebank landraces in the yam breeding program to generate the selected breeding lines. This can be further elucidated from Table S1, wherein the pedigree information of the breeding lines has been provided. The central core of the QGRAPH ( Figure 5B) represented a set of genebank landraces that were genetically similar and probably not used in the yam breeding program to generate breeding lines. Similarly, the network analysis between genebank landraces and market varieties indicated that these are genetically closer ( Figure 5C). These findings highlighted the genetic relationships among different genetic groups of D. rotundata and the limited use of genebank landraces and market varieties in the breeding program.
(breeding lines, genebank landraces, and market varieties). Although the phylogenetic tree grouped 803 genotypes into four distinct clusters, the grouping was unable to highlight any particular pattern for the three genetic groups under study. On the contrary, the network analysis among these genetic groups ( Figure 5) revealed a centralized structure and genetic contribution of each genotype within a genetic group (genebank landrace, breeding lines, and market varieties). The network analysis between breeding lines and market varieties showed no direct genetic relationship among them, although some of the breeding lines may have been used directly as market varieties ( Figure 5A), since there is a tendency with the farmers/traders to re-name the genotypes/breeding lines using their own naming system, which is associated with a popular market variety name. The naming system is also complex and depends upon the region within the country. For example, one of the popular market varieties in Nigeria is Hembakwase, which corresponds to several breeding lines (TDr 09/00023) and is associated with other market varieties such as Makakuasa and Omi_efun. Similarly, market variety Alumaco_1 was found to be similar to genebank landrace TDr 3584 and other market varieties such as Alumaco_2, TDr_Adaka, and TDr_Idu_Ekpeye. In addition, market variety TDr_Ehuru was identical to the breeding line TDr 08/00628. The network analysis between breeding lines and genebank landraces ( Figure 5B) showed a strong genetic relationships indicating the use of genebank landraces in the yam breeding program to generate the selected breeding lines. This can be further elucidated from Table S1, wherein the pedigree information of the breeding lines has been provided. The central core of the QGRAPH ( Figure 5B) represented a set of genebank landraces that were genetically similar and probably not used in the yam breeding program to generate breeding lines. Similarly, the network analysis between genebank landraces and market varieties indicated that these are genetically closer ( Figure 5C). These findings highlighted the genetic relationships among different genetic groups of D. rotundata and the limited use of genebank landraces and market varieties in the breeding program.

Discussion
The present study dissected the genetic relationships between different genetic groups (breeding lines, genebank landraces, and market varieties) of white yam/white guinea yam (D. rotundata) so that diverse genetic materials are utilized in the yam breeding program to introgress gene(s) of interest for its improvement. Despite several studies to assess the genetic diversity of white yam, little is known about the population structure or genetic diversity in contrast to other crops. A previous study [31] used a total of 94 landrace/gene bank accessions across seven guinea yam species to understand genetic diversity and their evolution. The allelic diversity, admixed patterns, and differential genome-wide population structure assayed by GBS-SNPs in three diverse genetic groups of D. rotundata further implied their efficacy in genomics-assisted breeding applications, which is one of the food security crops in the yam belt of West and Central Africa. The 3432 SNP markers identified in the current study were distributed across 21 pseudo chromosomes as per Tamiru et al. [37]. Sansaloni et al. [57] concluded in their study that the clustering of SNPs within certain regions across the genome is an issue due to the reduced representation method used in developing the probes in GBS sequencing while resulting in low genome coverage.
This study elucidated that the majority of genetic variance exists within countries instead of between countries in the D. rotundata core collection. This was evident in DAPC clustering wherein the landraces from different countries were grouped together (Table S1). Several other studies including cowpea did not observe significant correlation between molecular clustering and geographic origins of genebank landraces [58]. The DAPC method revealed more clusters than ADMIXTURE (Tables S3 and S4), while the latter method provided information on genotypes with ancestries. The DAPC approach relies on discriminant functions and maximizes the diversity between clusters and minimizes within-cluster diversity [44]. However, DAPC-based clustering was found to be less efficient in clonally propagated crops such as white yam because of their continuous and complex population structure. This has been also reported in other clonally propagated crops such as cassava [59]. In general, DAPC cluster membership assignment was in agreement with ADMIXTURE clusters. Admixed ancestry was observed among breeding lines, which probably reflected their complex breeding history involving open pollination and bi-parental crossing coupled with strong adaptive selection pressure [60]. The admixed ancestry observed in the genebank landraces could be due to complex domestication patterns of D. rotundata landraces during evolutionary divergence. It has been explicitly demonstrated that West Africa at the forest/savannah ecotone is the cradle of yam domestication [8,11]. Furthermore, the inclusion of diverse landraces as common parents in the yam improvement program to develop/breed for valuable agronomic traits and higher yield might have influenced their population group assignment, resulting in numerous admixtures among these breeding lines (Table S1). In the present study, we have successfully unraveled the underlying genetic relationships and population structures among different genetic groups through network analysis. In the absence of complete pedigree records (across several years of breeding) or where breeding lines were selected from open-pollinated seeds, the dissection of genetic relationships among different genetic groups through network analysis was a reliable process. This has been successfully elucidated in cassava [59], wherein GBS-SNPs and complementary cluster analysis were used to assess the population structure and variety identification.
In clonal crops such as yams, improved varieties or breeding lines are often generated through intergenerational crosses (mainly bi-parental), which depends again on the flowering behavior of the parents used in those crosses across years. This has necessitated the use of the same parents with known flowering behavior in crossing programs at the Yam Breeding Unit, thus narrowing down the genetic base used. The genetic diversity observed among genebank landraces was high based on the Shannon-Weaver index, observed and expected heterozygosity, and this could be attributed to the extent of diversity captured within the core collection [37] and the percent of unique accessions (94%) ( Table 2) identified in the present study. The genebank landraces were distributed across all the clusters generated through DAPC, ADMIXTURE, and phylogenetic tree, representing a significant amount of genetic diversity in the D. rotundata core collection. Hence, the diverse genebank landraces can be used as parents after the preliminary evaluation of flowering behavior and trait profiling, in white yam breeding programs to broaden the genetic base. Furthermore, this study was able to identify the complex naming system followed by farmers/traders to market varieties that were independently collected from different markets in Nigeria and Ghana, resulting in discrepancies from synonymy and homonymy in the tracking of released breeding lines when relying on use of names alone. Therefore, the is a need for a broader study including market varieties from a wide range of markets from different regions within Nigeria and Ghana to establish the inconsistencies with varietal names and its effect on the formal seed system in yams.

Conclusions
In conclusion, our study on white yam/white guinea yam represented a larger set of genotypes representing different genetic groups such as genebank landraces/breeding lines/market varieties. The genetic relationships dissected among different genetic groups in this study could be further explored in the white yam improvement by identifying diverse parents to generate mapping populations for target traits. Further studies are clearly needed to introgress gene(s) of target traits for white yam improvement. This study confirmed the reliability and accuracy of high-density SNP markers generated from next-generation sequencing-based genotyping coupled with complementary statistical analyses for genetic diversity and population structure.
Supplementary Materials: The following are available online at http://www.mdpi.com/2073-4395/10/9/1437/s1, Table S1. List of genotypes analyzed with information on their name, respective genetic group, geographic origin/pedigree, and results of DAPC clustering. Table S2. Identity by state (IBS)-based dissimilarity matrix. Table S3. Summary of DAPC cluster groups at different K values. Figure S1. Determination of optimal number of ADMIXTURE clusters using cross-validation error rates for K = 2 to K = 50. Table S4. Summary of ADMIXTURE cluster groups at different K values. Figure S2. Phylogenetic tree consisting of 803 genotypes using 3432 SNPs.

Acknowledgments:
The authors would like to thank Ibukun Ogunleye and other Bioscience Center staff at IITA for the collection of samples and DNA extraction. We acknowledge Andreas Gisel and Yusuf Muideen for their critical advice on statistical analyses. The authors would also like to thank the Boyce Thompson Institute (BTI) Ithaca, Cornell University for providing server space for raw data deposit.

Conflicts of Interest:
The authors declare no conflict of interest.
Data Availability: The datasets generated during and/or analysed during the current study are available on yambase (https://yambase.org).