Genotyping by Sequencing Reasserts the Close Relationship between Tef and Its Putative Wild Eragrostis Progenitors

The genus Eragrostis consists of 350 species, including tef (Eragrostis tef (Zucc.) Trotter), the only cultivated species in this genus. Very little is known about the genetic potential of these species for tef improvement and genomics research. Here, we investigated a germplasm panel consisting of 40 Eragrostis species and 42 tef lines with single nucleotide polymorphism (SNP) data generated using the genotyping by sequencing (GBS) protocol. Thousands of SNPs were identified genome-wide from the germplasm panel. High-quality SNPs were used to assess sequence similarity and/or divergence, genetic diversity, population structure, and phylogenetic relationships. Mapping individual reads to the tef reference genome revealed that of the 40 wild Eragrostis species included in this study, E. pilosa, E. aethiopica, E. obtusa, E. ferruginea, E. lugens, and E. lehmanniana had 92% of their sequences represented in the tef reference genome. In the maximum likelihood phylogenetic analysis, these wild species clearly showed grouping in the clade consisting of the entire tef germplasm. Population structure analysis showed two major clusters consistent with the germplasm class information and the inferred phylogenetic relationships. The wild Eragrostis species were more diverse than the tef cultivars and could therefore potentially be used to enrich the tef gene pool. The SNP dataset and the results documented here are taxonomically the most inclusive to date and could be a useful informational tool for the design of genomics-informed tef breeding and research.


Introduction
Tef is the socially and agriculturally dominant crop in Ethiopia. Its dominance over other major cereal crops such as wheat and barley comes from its resilience to poor growth conditions [1], highest market prices among cereals [2], and use as human food and animal feed. These qualities and uses make tef the iconic national cereal of Ethiopia. It is estimated that injera, the pancake-like fermented flatbread produced from tef flour, is consumed daily by over 70 million people in Ethiopia. Annually, tef is cultivated on over 3 million hectares of land, with an estimated annual production of over 4.5 million tons [3]. In Ethiopia, tef is cultivated mainly during the main growing season, from July/August to October/December. In some areas, tef is also cultivated in the short rainy season, from February to May/June. Arguably, tef is becoming a globally important cereal. Tef contains 3% fat, 11% protein, and 80% complex carbohydrate [4]. Lysine is an indispensable amino acid in humans and animals. In general, lysine is low in all cereal proteins and hence a limiting amino acid in cereal-based diets [5]. The importance and benefits of lysine in the diet include maintenance and growth. Nutritionally, tef contains higher amounts of the essential amino acids, its lysine content is higher than that of all other cereals except rice and oats, and its mineral content is substantial [1]. Tef has been shown to be gluten-free [6], and for this reason the demand for tef products is increasing globally.
The sequencing of the tef genome, the first indigenous Ethiopian crop to be sequenced, was a milestone in the history of genomic studies on Eragrostis species. Understanding the genome of this allotetraploid species will shed light on its evolutionary history. The genome sequence was obtained from libraries constructed from the genomic DNA of the improved variety of tef, Tsedey (DZ-Cr-37), using the Illumina HiSeq 2000 and 454-FLX pyro-sequencing platforms. The tef genome is one of the few genomes sequenced from crops regarded as "orphan", which have tremendous social and economic importance for millions of people living in developing countries. The assembly level of the genome is scaffold, and it contains 11,509 scaffolds and 2548 contigs obtained from a total of 40 Gbp single-and paired-end sequencing reads. The size of the assembly is 672 Mbp, with 54-genome coverage and a scaffold N50 of 85 kbp. This size is 58 Mb smaller than the genome of sorghum (Sorghum bicolor L.), 242 Mb larger than that of rice (Oryza sativa L.), 125 Mb greater than that of foxtail millet (Setaria italica), and about 25 times smaller than that of bread wheat (Triticum aestivum L.).
Species in the genus Eragrostis are primarily tropical and subtropical in distribution and are most commonly found in weedy areas and dry habitats [8]. Except for tef (Eragrostis tef ), which originated and has been domesticated in Ethiopia [9], none of the 349 species in this genus are cultivated for human consumption and hence are considered to have little economic importance. Thus these species have received little research attention. However, this situation seems to have changed in recent years as tef researchers started to harness the potential of the wild Eragrostis species as a source of novel variability and for evolutionary studies targeting the close relatives of tef.
In recent years, analyses of SNPs using next-generation sequencing protocols have commonly been used in genetic and genomic studies such as genome-wide association studies, population genomic analysis, construction of genetic linkage maps, reconstruction of phylogenetic relationships, and identification of quantitative trait loci. Genotyping by sequencing (GBS) is one such NGS-based genotyping protocol, which works by cutting genomic DNA with a methylation-sensitive enzyme and sequencing the genomic regions flanked by the restriction enzyme in hundreds of individual samples simultaneously [10]. Compared to other NGS-based protocols such as the restriction site-associated DNA (RAD-seq) protocol [11], the GBS protocol is a simplified and cheaper alternative technology. This protocol has been adopted for several crop plants and for a range of applications including rice [12][13][14], wheat [15], barley [15,16], sorghum [17], maize [18][19][20], pearl millet [21], chickpea [22], cultivated oat [23], barnyard millet [24], cotton [25], grape [26], common bean [27], and switch grass [28].
To address a range of diversity, breeding, conservation, and evolutionary questions regarding tef, data covering the wild species, preferably including those suggested as close relatives, is needed. Here, we report the application of the GBS protocol to a panel comprising 82 accessions (40 wild Eragrostis species, 31 tef cultivars, 1 mutant line, and 10 improved varieties) ( Table S1). The objectives of this study were (1) to identify single nucleotide polymorphisms (SNPs) genome-wide, (2) to use the SNP dataset to perform genomic diversity analyses in order to determine phylogenetic relationships and the population structure of the panel, and (3) to assess the potential of GBS as a cost-effective alternative platform for tef genomics research. In this work, we report the results of the first application of the GBS protocol on selected species of the genus Eragrostis.

DNA Extraction, Library Preparation, and Genotyping by Sequencing
Seedlings of each germplasm were grown under 12 h light at 24 • C and 12 h dark at 18 • C with 65% relative humidity in the growth room at the Institute of Plant Sciences, University of Bern, Switzerland. After 4 weeks, 100 mg of leaf tissue was harvested and genomic DNA was isolated using the CTAB (Cetyl trimethylammonium bromide) method [32]. DNA concentrations were normalized to 30 ng/µL on a 96-plex PCR plate and shipped to the Institute for Genomic Diversity, Cornell University (Ithaca, NY, USA) for library preparation and sequencing [10]. Genomic DNA from each germplasm was digested using ApeKI restriction enzyme, and unique barcodes were attached to each sequence of the individual germplasm. The ApeKI Eragrostis library on the 96-plex plate (95 samples and a blank) was then sequenced on an Illumina HiSeq 2500 platform.

GBS Raw Data Processing
The sequence reads of the 95 samples along with the key file listing barcodes for the samples and the plate layout were downloaded from the sequencing platform and processed using the TASSEL-GBS pipeline for species with a reference genome [33]. First, the pipeline identified and removed bad-quality reads and reads that did not contain adaptors or Ns in the useful part of the sequence, and retained high-quality reads. The trimmed reads were then used to generate unique tags.

Mapping Reads to the Tef Reference Genome and SNP Calling
The unique tags generated in the previous step were exported in fastq format for mapping to the tef reference genome using the Burrows-Wheeler Aligner (BWA) [34]. The Sequence Alignment Maps (SAMs) generated after read-mapping were further processed to call SNPs using the DiscoverySNPCaller plugin embedded within the pipeline. This plugin called SNPs across the individual samples and produced the SNP dataset in HDF5 format, which was then converted into Variant Call Format (VCF) for quality filtering and extraction of useful statistics using VCFtools [35] and SAMtools [34]. The SNP dataset was filtered by the number of alleles to include only biallelic sites (min-alleles = 2 and max-alleles = 2); max-missing (coverage) 50%, 60%, 70%, 80%, 90%, and 100%; and sites with minor allele frequency (MAF) 0.05. According to SAMtools, when the max-missing flag is specified with a value of 1, for example, it means that no missing data is allowed for all the individuals, corresponding to 100% coverage of all sites.

Mapping Reads to the Tef Pseudo-Chromosomes
To visualize the distribution of SNPs in the tef genome, reads were mapped to each of the 10 individual tef pseudo-chromosome assemblies [7] using the BWA. After mapping, 10 separate VCF files per pseudo-chromosome were generated. These VCF files were processed with custom scripts and R (https://www.R-project.org/). The pseudo-chromosomes were divided into equal-size chunks using the R software packages plyr and dplyr. Using the ggplot2 function [36], the SNPs were plotted against their respective physical positions on each of the 10 pseudo-chromosomes.

Population Structure Analysis
The population structure of the panel was determined using 3 clustering approaches. First, principal component analysis (PCA) was performed. The SNP dataset in VCF format was converted into the genomic data structure (GDS) data storage format using two high-performance computing R/Bioconductor packages, gdsfmt and SNPRelate [37]. Next, the dataset was LD-pruned as recommended in the SNPRelate package so that only SNPs that were in approximate linkage equilibrium with each other were used, to avoid the strong influence of SNP clusters in principal component and relatedness analysis. The LD-pruned SNP data was then used to calculate the genetic covariance matrix from the genotypes, compute the correlation coefficients, and calculate SNP eigenvectors. The names of individual germplasms and the population codes (cultivars, improved variety, mutant, and wild species) were used as input together with the GDS file. Then, the first 2 and 4 principal components were plotted.
Second, multidirectional scaling analysis was performed. For this analysis, an n × n matrix of genome-wide average identity-by-state pairwise distances were generated from the SNP scores in GDS format using the snpgdsIBS function of the SNPRelate R package. The population structure information contained in the n × n distance matrix was plotted to visualize the structure.
Third, for admixture analysis, 2 programs were used. The ADMIXTURE [38] program is a tool for maximum likelihood estimation of individual ancestries from multi-locus SNP genotype datasets. For each K, the number of ancestral populations, the program generates 2 output files, the ancestry fractions and the allele frequencies of the inferred ancestral populations. If the number of ancestral populations is unknown, the program includes a cross-validation procedure that allows the user to identify the value of K for which the model has best predictive accuracy. The value of K that exhibits a low cross-validation error is chosen compared with other K values.
To display the population structure, individual Q-matrices (for the respective Ks) were plotted in R using the bar plot function. We also determined population structure using the fastStructure [39] program, which was developed for inferring population structure from large SNP genotype data. For fastStructure, the plink files were used as input and the expected admixture proportions inferred were plotted with the distruct.py tool provided by the software. Population fixation statistics (F ST ) and nucleotide diversity (π) were calculated using the PopGenome package in R.

Molecular Phylogenetic Analysis
For phylogenetic analysis, a pair of primers were designed from the waxy gene: forward (5 TGCGAGCTSGACAACATCATGC3 ) and reverse (5 CGGCCACGTTCTCCYTGGCGAG3 ). PCR was performed using the DNA isolated from E. aethiopica, E. ferruginea, E. lehmanniana, E. lugens, E. obtusa, E. pilosa #223260, and E. tef cv Tsedey. The PCR condition was 40 cycles at 95 • C for 30 s, 61 • C for 30 s, and 72 • C for 80 s. PCR products were cloned into plasmids and sequenced using a Sanger sequencer.
Phylogenetic analysis was performed using 2 programs, the Randomized Axelerated Maximum Likelihood (RAXML) program [40] with the general time-reversible model of nucleotide evolution and the gamma model of rate variation, and molecular evolutionary genetics analysis using maximum likelihood, evolutionary distance, and maximum parsimony methods (MEGA) [41]. Trees were visualized using Dendroscope [42] and the MEGA software. The tree presented in the results is from MEGA, for better visualization.

Genotyping by Sequencing of the ApeKI Eragrostis Species Library
We generated sequencing targets within the germplasm panel by digesting the DNA samples from each individual with the ApeKI restriction enzyme. Sequencing of these targets on an Illumina HiSeq 2000/2005 generated about 237 million single-end reads (Table S2). The number of reads per germplasm ranged from 1.6 to 4.0 million (Table S3, Figure S1), with a mean sequencing depth of about 2.5 million reads. Three of the 95 samples (Eragrostis trichodes, Hamrawi-murri, and Jano) were excluded from further analysis due to very low read counts.
High-quality reads from each individual sample were collapsed into tags. Of these, 75% were mapped to physical positions on the indexed tef reference genome. Because of the mosaic of species included in our panel, we were interested to know the proportion of reads mapping to the reference. The reads from the tef germplasm were represented in the tef reference genome with mapping rates of >93% (Table S3). In contrast, reads from most of the wild Eragrostis species showed a mapping rate in the range of 56% (E. acutiglumis) to 94.5% (E. pilosa). Interestingly, six wild species, E. pilosa, E. aethiopica, E. lehmannania, E. ferruginea, E. lugens, and E. obtusa, demonstrated 91-95% mapping rates, close to the mapping rates of the tef germplasm. Next, we probed the mapping files with the TASSE-GBS "SNPDiscovery" pipeline and identified a total of 419,999 SNPs. For subsequent analysis, this SNP dataset was filtered to contain biallelic sites, with minor allele frequency (MAF) 0.05 and ranges of coverage (50%, 60%, 70%, 80%, 90%, and 100%) across the germplasm ( Figure S2).

Number of SNPs Correlates with Chromosome Length
To see how chromosome length affects the number of SNPs discovered, we carried out a correlation test. We examined the SNP data generated by mapping the reads to the tef pseudo-chromosome assembly and computed the Pearson's product-moment correlation analysis using the R software corrplot package (https://github.com/taiyun/corrplot). We found a trend of increasing number of SNPs as the length of the pseudo-chromosome (from here on called pseudomolecule) increased, with the number of SNPs positively and strongly correlated (R 2 = 0.98) with pseudomolecule length ( Figure S3).

SNP Distribution along the 10 Tef Pseudomolecules
In order to study the genomic distribution of SNPs in the tef genome, we mapped the reads to the pseudo-chromosome assembly. Table S4 shows that the longest tef pseudomolecule had almost twice the average number of SNPs identified on the entire pseudomolecule assembly, while the shortest pseudomolecule had roughly one-third of the total average. We plotted the SNPs from individual pseudomolecules against their respective physical positions, as shown in Figure 1, where the number of SNPs is shown for each position of the 10 tef pseudo-chromosomes. We found that the distribution of SNPs over a sliding window of 106 Mb represented by the spectrum of the vertical bar graph was nonuniform. The number of SNPs per Mb ranged from 169 in pseudomolecule 5 to 262 in pseudomolecule 4 (Table S4). However, SNP density variation between pseudomolecules appeared to be constant. In all pseudomolecules, we found regions with either elevated or very low SNP density. This pattern likely corresponds to the properties of the genomic region. Hence characterizing these regions as coding, noncoding, intron, or exon regions will increase our understanding of the patterns of SNP variation.  (Table S2). The number of reads per germplasm ranged from 1.6 to 4.0 million (Table S3, Figure S1), with a mean sequencing depth of about 2.5 million reads. Three of the 95 samples (Eragrostis trichodes, Hamrawi-murri, and Jano) were excluded from further analysis due to very low read counts.
High-quality reads from each individual sample were collapsed into tags. Of these, 75% were mapped to physical positions on the indexed tef reference genome. Because of the mosaic of species included in our panel, we were interested to know the proportion of reads mapping to the reference. The reads from the tef germplasm were represented in the tef reference genome with mapping rates of >93% (Table S3). In contrast, reads from most of the wild Eragrostis species showed a mapping rate in the range of 56% (E. acutiglumis) to 94.5% (E. pilosa). Interestingly, six wild species, E. pilosa, E. aethiopica, E. lehmannania, E. ferruginea, E. lugens, and E. obtusa, demonstrated 91-95% mapping rates, close to the mapping rates of the tef germplasm. Next, we probed the mapping files with the TASSE-GBS "SNPDiscovery" pipeline and identified a total of 419,999 SNPs. For subsequent analysis, this SNP dataset was filtered to contain biallelic sites, with minor allele frequency (MAF) 0.05 and ranges of coverage (50%, 60%, 70%, 80%, 90%, and 100%) across the germplasm ( Figure S2).

Number of SNPs Correlates with Chromosome Length
To see how chromosome length affects the number of SNPs discovered, we carried out a correlation test. We examined the SNP data generated by mapping the reads to the tef pseudochromosome assembly and computed the Pearson's product-moment correlation analysis using the R software corrplot package (https://github.com/taiyun/corrplot). We found a trend of increasing number of SNPs as the length of the pseudo-chromosome (from here on called pseudomolecule) increased, with the number of SNPs positively and strongly correlated (R 2 = 0.98) with pseudomolecule length ( Figure S3).

SNP Distribution along the 10 Tef Pseudomolecules
In order to study the genomic distribution of SNPs in the tef genome, we mapped the reads to the pseudo-chromosome assembly. Table S4 shows that the longest tef pseudomolecule had almost twice the average number of SNPs identified on the entire pseudomolecule assembly, while the shortest pseudomolecule had roughly one-third of the total average. We plotted the SNPs from individual pseudomolecules against their respective physical positions, as shown in Figure 1, where the number of SNPs is shown for each position of the 10 tef pseudo-chromosomes. We found that the distribution of SNPs over a sliding window of 106 Mb represented by the spectrum of the vertical bar graph was nonuniform. The number of SNPs per Mb ranged from 169 in pseudomolecule 5 to 262 in pseudomolecule 4 (Table S4). However, SNP density variation between pseudomolecules appeared to be constant. In all pseudomolecules, we found regions with either elevated or very low SNP density. This pattern likely corresponds to the properties of the genomic region. Hence characterizing these regions as coding, noncoding, intron, or exon regions will increase our understanding of the patterns of SNP variation.

Principal Component Analysis Captures the Genetic Differentiation between Tef and Wild Eragrostis Species
To get an idea of the number of ancestral populations (K) to use in our population structure analysis, we first used principal component analysis (PCA). PCA reduced the dimension in our data, with the first two principal components together explaining about 60% of the variation in the dataset ( Figure 2). We found one major cluster containing the tef cultivars, improved tef varieties, and the mutant line GA10 (circled in red). This is consistent with the germplasm class information (Table S1). This cluster also contains some wild Eragrostis species that were previously suggested to be close to tef. PCA failed to find a clear structure in the wild species subpopulation, with the germplasm showing a large dispersion. However, two subclusters are apparent in the top and bottom corners of the PCA plot.

Principal Component Analysis Captures the Genetic Differentiation between Tef and Wild Eragrostis Species
To get an idea of the number of ancestral populations (K) to use in our population structure analysis, we first used principal component analysis (PCA). PCA reduced the dimension in our data, with the first two principal components together explaining about 60% of the variation in the dataset ( Figure 2). We found one major cluster containing the tef cultivars, improved tef varieties, and the mutant line GA10 (circled in red). This is consistent with the germplasm class information (Table  S1). This cluster also contains some wild Eragrostis species that were previously suggested to be close to tef. PCA failed to find a clear structure in the wild species subpopulation, with the germplasm showing a large dispersion. However, two subclusters are apparent in the top and bottom corners of the PCA plot.

Principal Component Analysis Captures the Genetic Differentiation between Tef and Wild Eragrostis Species
To get an idea of the number of ancestral populations (K) to use in our population structure analysis, we first used principal component analysis (PCA). PCA reduced the dimension in our data, with the first two principal components together explaining about 60% of the variation in the dataset ( Figure 2). We found one major cluster containing the tef cultivars, improved tef varieties, and the mutant line GA10 (circled in red). This is consistent with the germplasm class information (Table S1). This cluster also contains some wild Eragrostis species that were previously suggested to be close to tef. PCA failed to find a clear structure in the wild species subpopulation, with the germplasm showing a large dispersion. However, two subclusters are apparent in the top and bottom corners of the PCA plot.

Population Structure in the Genus Eragrostis
To perform population structure analysis, we used the ADMIXTURE program, which estimates the structure of ancestral populations, and the fastStructure program. Both programs identified a similar structure in the panel that matched the results of the PCA, with the most likely assignment occurring at K = 2, meaning two ancestral populations. Each ancestral population is shown in a different color within each plot (Figure 3 for K = 2 and Figure S4 for other values of K). Our analysis did not detect population structure among the tef subpopulations. The first distinct subgroup in the first half of the structure plot is composed of all the tef cultivars, improved tef varieties, and the mutant line; the other half of the plot consists of the wild Eragrostis species. Our analysis indicated both long and short stretches of mixed ancestry for some of the wild Eragrostis species, while the genomic composition of the tef cultivars appears to be homogeneous. This suggests that at K = 2, the germplasm panel shows structure that summarizes the germplasm class information (Table S1).

Population Structure in the Genus Eragrostis
To perform population structure analysis, we used the ADMIXTURE program, which estimates the structure of ancestral populations, and the fastStructure program. Both programs identified a similar structure in the panel that matched the results of the PCA, with the most likely assignment occurring at K = 2, meaning two ancestral populations. Each ancestral population is shown in a different color within each plot (Figure 3 for K = 2 and Figure S4 for other values of K). Our analysis did not detect population structure among the tef subpopulations. The first distinct subgroup in the first half of the structure plot is composed of all the tef cultivars, improved tef varieties, and the mutant line; the other half of the plot consists of the wild Eragrostis species. Our analysis indicated both long and short stretches of mixed ancestry for some of the wild Eragrostis species, while the genomic composition of the tef cultivars appears to be homogeneous. This suggests that at K = 2, the germplasm panel shows structure that summarizes the germplasm class information (Table S1).

Molecular Phylogenetic Analysis Grouped Six Wild Species within the Tef Cultivars Clade
To infer the phylogenetic relationships among the Eragrostis species in the panel, molecular phylogenetic analysis was performed. Maximum likelihood estimation of the phylogenetic tree resulted in a clear separation between the tef cultivars and the wild Eragrostis species (Figure 4). However, out of the 40 wild Eragrostis species included in the study, six species (E. pilosa, E. aethiopica, E. lehmanniana, E. lugens, E. obtusa, and E. ferruginea) fell within the tef clade (red points) with strong bootstrap support. The resolution of our tree did not show the presence of intraspecific differences among the tef cultivars or even between the wild species that were grouped in this clade. The improved tef varieties that were direct selections from the tef collections and introgressions and the mutant line that was developed from an improved tef variety also fell within the tef cultivars clade. This clustering pattern partly reflects the high genetic similarity of the tef cultivars, and hence the Each vertical bar represents one germplasm, while each color represents the inferred ancestral population based on K clusters (in this case K = 2). For each germplasm, columns fully colored with only one color represent genetic homogeneity, while columns with mixed colors illustrate admixture.

Molecular Phylogenetic Analysis Grouped Six Wild Species within the Tef Cultivars Clade
To infer the phylogenetic relationships among the Eragrostis species in the panel, molecular phylogenetic analysis was performed. Maximum likelihood estimation of the phylogenetic tree resulted in a clear separation between the tef cultivars and the wild Eragrostis species (Figure 4). However, out of the 40 wild Eragrostis species included in the study, six species (E. pilosa, E. aethiopica, E. lehmanniana, E. lugens, E. obtusa, and E. ferruginea) fell within the tef clade (red points) with strong bootstrap support. The resolution of our tree did not show the presence of intraspecific differences among the tef cultivars or even between the wild species that were grouped in this clade. The improved tef varieties that were direct selections from the tef collections and introgressions and the mutant line that was developed from an improved tef variety also fell within the tef cultivars clade. This clustering pattern partly reflects the high genetic similarity of the tef cultivars, and hence the narrow genetic base of the tef improvement process. We subsampled the data for the wild Eragrostis species and inferred a separate tree to visualize their relationships ( Figure S5). narrow genetic base of the tef improvement process. We subsampled the data for the wild Eragrostis species and inferred a separate tree to visualize their relationships ( Figure S5).

Wild Species Show High Level of Genetic Differentiation Compared to the Tef Cultivars Subpopulation
To examine the genetic diversity and differentiation within and among the subpopulations in our germplasm panel, we estimated fixation index (FST) [43] and nucleotide diversity (π) [44]. The average nucleotide diversity was π = 0.0047 for the tef cultivars, nearly equal to that for the improved tef varieties, π = 0.0040. On the other hand, the nucleotide diversity was higher for the wild Eragrostis species, π = 0.3457, than for the tef cultivars. The low nucleotide diversity estimates of the tef cultivars closely matched those estimated using the RAD-seq data (not shown). When all the species were considered together for the nucleotide diversity estimation, the average estimate was larger (π = 0.2183) than the estimate for the tef cultivars alone. Pairwise genotypic differentiation measured as FST for tef cultivars and wild species as well as improved tef varieties and wild species pairs were FST = 0.468 and FST = 0.381, respectively. In contrast, the genotypic differentiation among the tef cultivars subpopulations was FST = 0.002.

Phylogeny Tree from the Waxy Gene
To investigate the relationships among the different subgenomes, waxy gene was cloned from E. aethiopica, E. ferruginea, E. lehmanniana, E. lugens, E. obtusa, E. pilosa, and improved tef variety Tsedey. A maximum-likelihood tree was constructed using PhyML ( Figure 5), and the clades labeled A-E were consistent with those of the neighbor-joining waxy tree reported previously [45], although the

Wild Species Show High Level of Genetic Differentiation Compared to the Tef Cultivars Subpopulation
To examine the genetic diversity and differentiation within and among the subpopulations in our germplasm panel, we estimated fixation index (F ST ) [43] and nucleotide diversity (π) [44]. The average nucleotide diversity was π = 0.0047 for the tef cultivars, nearly equal to that for the improved tef varieties, π = 0.0040. On the other hand, the nucleotide diversity was higher for the wild Eragrostis species, π = 0.3457, than for the tef cultivars. The low nucleotide diversity estimates of the tef cultivars closely matched those estimated using the RAD-seq data (not shown). When all the species were considered together for the nucleotide diversity estimation, the average estimate was larger (π = 0.2183) than the estimate for the tef cultivars alone. Pairwise genotypic differentiation measured as F ST for tef cultivars and wild species as well as improved tef varieties and wild species pairs were F ST = 0.468 and F ST = 0.381, respectively. In contrast, the genotypic differentiation among the tef cultivars subpopulations was F ST = 0.002.

Phylogeny Tree from the Waxy Gene
To investigate the relationships among the different subgenomes, waxy gene was cloned from E. aethiopica, E. ferruginea, E. lehmanniana, E. lugens, E. obtusa, E. pilosa, and improved tef variety Tsedey. A maximum-likelihood tree was constructed using PhyML ( Figure 5), and the clades labeled A-E were consistent with those of the neighbor-joining waxy tree reported previously [45], although the branching deep within the tree was not highly supported. New waxy sequences from E. lehmanniana, E. obtusa, E. pilosa, and Tsedey had copies in different subgenomes. The E. pilosa seeds were obtained from the US Department of Agriculture and appeared heterogeneous, which may explain why our new E. pilosa appeared in the D clade and not in A or B, as reported previously [45]. Most interesting is the placement of E. aethiopica in the A subgenome of tef along with the A subgenome of E. pilosa. This is strong evidence for E. aethiopica being a diploid progenitor of either tef or E. pilosa.  [45]. Most interesting is the placement of E. aethiopica in the A subgenome of tef along with the A subgenome of E. pilosa. This is strong evidence for E. aethiopica being a diploid progenitor of either tef or E. pilosa.

Genotyping by Sequencing Enabled Comprehensive Genomic Analysis of Eragrostis Species
GBS-generated SNPs provided useful genome-scale data to perform genomic variation, high-density linkage mapping, and phylogenetic and population genomic analysis for various crops [12,[15][16][17]22,24,46]. However, no such study exists on Eragrostis species.
In the present study, we surveyed the genomes of selected Eragrostis species panel using the GBS protocol in combination with the tef reference genome and pseudomolecule assembly. Thousands of SNPs were discovered from the panel, which is composed of the tef cultivars, improved tef varieties, a mutant line, and the wild Eragrostis species.

Genomic Distribution of GBS-SNPs in the Tef Genome
The genomic distribution of SNPs across tef's pseudomolecules was uneven, with moderate SNP density per Mb (Figure 1), and was largely in agreement with results reported for various crop species, including rice [47,48], wheat [15,49], common bean [27], soybean [50,51], barley [15], cabbage [52], chickpea [53], and cotton [46]. As part of an ongoing investigation, examining the relationship between the patterns of SNP distribution and/or density and the presumed functional consequences on genes in the different parts of the tef genome is suggested.

Sequence Divergence between Tef Cultivars and Putative Wild Progenitors
Using genome-scale GBS data, we argue that the high sequence similarity (>92%) between E. pilosa, E. aethiopica, E. lugens, E. ferrugenia, E. lehmanniana, and E. obtusa and tef show that these species could be close relatives of tef. We confirmed this similarity in our phylogenetic analysis by showing the grouping of these six species with the tef cultivars. As diploid species, it seems likely that E. aethiopica, E. lugens, and E. lehmanniana [54] can potentially be the diploid progenitors of tef. We propose that one could perform comparative genomic analysis to determine if these species are indeed the diploid progenitors. Such analysis could shed light on the obscured identity of the diploid subgenomes inside tef.

Low Nucleotide Diversity in the Tef Species
One of the measures of genetic variability is nucleotide diversity (π), which is defined as the number of differences per nucleotide site between any two randomly chosen sequences from a population. Nucleotide diversity in major cereal crops such as wheat, maize, and barley has been reduced by domestication [55] and can fall to around 40% of the diversity of wild relatives. Tef is a strictly self-pollinating chasmogamous crop with 0.1% to 1% outcrossing [56] and shows a low-molecular but wide range of phenotypic diversity, reflecting adaptations to different agro-ecologies [1,57]. In earlier studies, nucleotide diversity was shown to be low in cultivated tef. For instance, haplotype analysis in 31 tef accessions showed low nucleotide diversity in all loci of the rht1 (π = 0.003) and sd1 (π = 0.0008) dwarfing genes [58]. Our result is in agreement with this result in that the tef species in our panel show low average nucleotide diversity (π = 004) genome wide, with relatively small population differentiation between subpopulations, despite the germplasm in each subpopulation coming from contrasting agro-ecologies. In contrast, the wild species show higher nucleotide diversity (π = 0.021). Our result is also in agreement with the low nucleotide diversity estimates that have consistently been shown for cultivated species compared to their wild counterparts [59,60], thus supporting the idea that domestication reduces nucleotide diversity at the genomic level [61][62][63]. To expand our knowledge and quantify the nucleotide diversity of the tef cultivars in more detail, one could examine nucleotide diversity among agronomically useful candidate genes.
Population genetic studies provide insight into the evolutionary processes that influence the nature and distribution of sequence variants within and among wild populations [64], and fixation index (F ST ), first defined by [65], is among the most widely used measures of genetic differentiation within and among populations. In theory, F ST ranges from 0 (no differentiation between the overall population and its subpopulations) to 1 (complete population differentiation). The self-pollinating nature of tef plants coupled with the redundant use of same cultivars over a broad range of agro-ecologies suggests that population differentiation in tef could be poorly defined. Consistent with this hypothesis, estimates of Wright's F ST in this study show that tef landraces are poorly (F ST = 0.002) and slightly (F ST = 0.01) differentiated from the landrace subpopulations and improved tef varieties, respectively. Naturally, this result suggests that the genetic background of the improved tef varieties is mainly composed of the landraces, and that tef improvement through selection from the landraces might have affected only certain loci. The potential contribution of wild species to broaden genetic variability in the tef species was demonstrated [66], hence could be of further interest to tef breeders.

Phylogenetic Analyses of Eragrostis Species Using Genome-Scale Data Reasserts Previously Reported Single-Gene-Based Analyses
Phylogenetic studies based on single-gene sequence analysis have shown the close relationship of E. pilosa and tef [45]. However, the consistency of species phylogenies derived from comparisons of single genes is debated, due to the impact of horizontal gene transfer [67] and highly variable rates of evolution [68]. The availability of genome-scale data allows the construction of a phylogeny that is less sensitive to such inconsistencies and more representative of whole genomes than are single-gene trees [69]. Moreover, genome-scale data is more advantageous than single-gene-based phylogenetic analysis, as the latter does not capture enough variation among species, since conserved genes have few polymorphic loci [45,70].
Sequence-based evidence for a phylogenetic relationship between the wild Eragrostis species and tef comes from [45]. There are key differences between that study and ours. The authors used 10 wild Eragrostis species, which were suggested to be the progenitors of tef in previous studies [45,71,72]. However, our species sampling was not constrained a priori to using only the species included in the Ingram and Doyle (2003) study, but included more wild species, including E. aethiopica, which consistently showed close relationships with tef cultivars in five previous studies (Table 1). They used the nuclear gene waxy and the plastid gene rps16 for phylogenetic analysis. In contrast, we used SNPs discovered genome-wide. In their analysis, Ingram and Doyle showed that alleles from E. pilosa 4.2PI213255 and E. pilosa 4.7, PI221926 were grouped together with tef cultivars in a clade designated as A (reflecting the A subgenome). However, only one of the alleles (that of E. pilosa 4.2PI213255) grouped with tef cultivars in clade B. The allele from E. pilosa 4.7, PI221926 did not show up in clade B on the tree. Table 1. Studies on the evolution and phylogenetic relationships between tef and the wild Eragrostis species. E. pilosa (1) and E. aethiopica (2) stand out as the two most consistent species identified as close relatives of tef. Of the five studies listed, only the last two studies used molecular data. Our tree was not constructed with this level of resolution, because we were not able to distinguish between reads coming from the A and the B subgenomes. However, it not only corroborates the grouping inferred by Ingram and Doyle, but also identifies a mosaic of additional wild species that show grouping with tef. We were unable to achieve tree resolution as reported in the previous study. We made intraspecific comparisons using SNP data generated from orthologous sequences from individual germplasms. The dataset was informative enough to resolve the phylogenetic tree at least into a tef-specific clade that included most of the wild species suggested as close relatives with strong bootstrap support. Nonetheless, it was not variable enough to give us a better resolution to depict intraspecific differences.

Reference
We suggest that the next step for tef genomics research should be experimental validation of a subset of the SNPs and examination of the functional consequences of gene-specific variation on useful agronomic traits. It is important to note that our analysis is based on the SNP data generated from reads that were mapped to a single position in the tef reference genome. We also used only biallelic sites, despite the mosaic of ploidy levels within our germplasm panel. Another limitation is that, due to the limited knowledge we had of most of the wild Eragrostis species, it was difficult to make sense of their grouping patterns. However, the data generated here will serve as a starting point for further sequence-based analysis, possibly assisted by detailed phenotyping of the wild species.

Coupling the Potential of the Wild Eragrostis Species with Tef Breeding
Many investigators have elaborated on the importance and use of wild species for crop improvement in the face of increasing human population and climate change [74][75][76]. However, determining the likely value of wild species for crop improvement requires the collection and subsequent characterization of the phenotypic, phonologic, and genomic diversity within the species and understanding their genetics. The role of the wild Eragrostis species in tef research has been insignificant. Even after the timing of the floral openings in tef was discovered 40 years ago [77], the cross-compatibility of tef is still restricted to only one wild species out of the 350 Eragrostis species. The interspecific hybrids or recombinant inbred lines between E. tef and E. pilosa have demonstrated their worth and, indirectly, that of the wild Eragrostis species in general, by improving the resolution of the genetic linkage map of tef [78]. With the cross-compatibility rate maximum of 1%, the genetic diversity of tef will remain restricted to itself. In addition, almost all of the previous genetic diversity studies solely used the tef cultivars [57]. We argue that our work changed this scenario by generating genomic data from 40 wild Eragrostis species. We revealed greater genomic diversity in these species than in the tef species. To further exploit the potential of these wild Eragrostis species, including them in the tef crossing program could be of further interest, albeit without hybridization issues. In addition, the tef breeding resource base is being expanded with the addition of mutant lines developed through TILLING and may prove essential for the future of tef breeding.

Deciphering the Diploid Pieces of the Allotetraploid Tef Genome
Allotetraploid plant species originate when the genomes of diploid species are brought together in hybrids and then duplicated, and in such species the genomes of the diploid parents become homologous subgenomes [79]. As the tef genome is an allotetraploid species, knowledge of its composition and evolution is crucial for tef genomics research and has important practical applications for tef breeding. Both sequence-based genomic analysis and genetic methods are expected to improve this understanding. Tef has two diploid subgenomes, designated as A and B, which are estimated to have diverged 4.0 MYA [7] and 6.4 MYA [58]. No direct wild progenitor or diploid ancestors of tef have yet been identified. In addition, our knowledge about the evolutionary history of the tef species has not been well organized. About nine distinct studies have attempted to find out the wild progenitors and/or close relatives of tef (Table 1). Despite differences in the data and species used in these studies, two species, E. aethiopica (diploid) and E. pilosa (tetraploid), stand out as the most consistent candidates. Although we did not confirm the ploidy level of E. aethiopica and E. pilosa, these species have high sequence similarity to the tef genome. Assessing the genetic legacy of these species for the evolution of the cultivated tef genome may first require comparative genomics of the two subgenomes within tef and parallel comparative genomics of these putative diploid species and tef. For this purpose, we believe that separating the A and B subgenomes is crucial.
The separation of homologous subgenomes has been approached in different ways in polyploid plant species. A recently developed program for durum wheat separates original contigs obtained by RNAseq into two homologous sequences based on maximum likelihood optimization [80]. To separate the subgenomes of the octoploid progenitors of cultivated strawberry Fragaria virginiana and Fragaria chiloensis, dense linkage maps generated by targeted sequence capture were implemented [81]. With this approach, the subgenomes of the wild octoploid progenitors of cultivated strawberry could be disentangled. Alternatively, as linkage map-based methods are showing promise, their application to the case of tef is worthwhile. Once this is resolved, comparative genomic analysis of the subgenomes and the putative diploid progenitors identified in our study and elsewhere will likely shed light on the tef identity crisis.
In general, the above-mentioned features render tef a difficult taxon for genomic studies and could hamper modern tef breeding efforts. While challenges such as identifying the exact identity of the two diploid subgenomes remain to be addressed, a framework through which there is interplay of the possible species toward the allotetraploid tef genome should not be too far off. By structuring the genetic and phylogenetic information on this species into a framework, we present a pathway depicting the two likely routes by which the tetraploid tef genome has evolved. This pathway ( Figure 6) consists of the species suggested previously along with the species that were identified in the current study. This enables a more focused and framework-oriented approach, ultimately informing tef breeding and genomics research. genomics of the two subgenomes within tef and parallel comparative genomics of these putative diploid species and tef. For this purpose, we believe that separating the A and B subgenomes is crucial.
The separation of homologous subgenomes has been approached in different ways in polyploid plant species. A recently developed program for durum wheat separates original contigs obtained by RNAseq into two homologous sequences based on maximum likelihood optimization [80]. To separate the subgenomes of the octoploid progenitors of cultivated strawberry Fragaria virginiana and Fragaria chiloensis, dense linkage maps generated by targeted sequence capture were implemented [81]. With this approach, the subgenomes of the wild octoploid progenitors of cultivated strawberry could be disentangled. Alternatively, as linkage map-based methods are showing promise, their application to the case of tef is worthwhile. Once this is resolved, comparative genomic analysis of the subgenomes and the putative diploid progenitors identified in our study and elsewhere will likely shed light on the tef identity crisis.
In general, the above-mentioned features render tef a difficult taxon for genomic studies and could hamper modern tef breeding efforts. While challenges such as identifying the exact identity of the two diploid subgenomes remain to be addressed, a framework through which there is interplay of the possible species toward the allotetraploid tef genome should not be too far off. By structuring the genetic and phylogenetic information on this species into a framework, we present a pathway depicting the two likely routes by which the tetraploid tef genome has evolved. This pathway ( Figure  6) consists of the species suggested previously along with the species that were identified in the current study. This enables a more focused and framework-oriented approach, ultimately informing tef breeding and genomics research. In this hypothesis, tef is suggested to be a shattering domesticate of this intermediate progenitor, with several studies pointing to E. pilosa. Whether the evolution of the cultivated tef genome followed the A or B route in this pathway, knowing the identities of the two diploid progenitors could be central to the future of tef genomics research, and comparative genomics will be the key. The symbol "?" indicates what is not known or suggested so far. In this hypothesis, tef is suggested to be a shattering domesticate of this intermediate progenitor, with several studies pointing to E. pilosa. Whether the evolution of the cultivated tef genome followed the A or B route in this pathway, knowing the identities of the two diploid progenitors could be central to the future of tef genomics research, and comparative genomics will be the key. The symbol "?" indicates what is not known or suggested so far.

Phylogeny Tree from the Waxy Gene
Two different datasets were used to assess the relationships between tef and the wild species in this study. One is a study of the waxy gene tree, which has the advantage that the A and B sequences can be separated and show the subgenomes contributing to each species. The disadvantage is that gene-tree phylogeny does not necessarily reflect the species tree. However, it is usually a good estimate if no unusual hybridization or lineage sorting has occurred.
The GBS method has the advantage that it samples the entire genome. The disadvantage of GBS is that the SNPs coming from different subgenomes have not been separated, so only an average over all subgenomes is seen. The waxy gene tree and the GBS tree are consistent in that E. aethiopica and E. pilosa are within the tef clade.
The phylogenetic tree constructed from the waxy gene ( Figure 5) reproduces the basic topology of Ingram and includes new sequences. As in Ingram, the closest Eragrostis to the B genome is E. heteromera, which is outside the tef/pilosa clade but a very close diploid. A new addition to the A clade is E. aethiopica, which falls within the clade containing tef and E. pilosa. This is strong evidence for E. aethiopica being a very close diploid to the A genome and the best candidate as a diploid progenitor for the A genome.
The waxy sequences of E. obtusa, E. ferruginea, E. lugens and E. lehmaniana are not close to the clades containing tef in the waxy tree presented here. E. lehmaniana is not close to tef in the waxy tree of Ingram. E. lugens has a D genome in the Ingram tree. We have two E. lugens sequences, which are not near the lugens sequence of Ingram or either E. tef clade.

Conclusions
In this study, the SNP data generated using the GBS protocol provides a useful molecular resource to facilitate tef improvement. The wild Eragrostis species demonstrated high genetic diversity and could prove essential in enriching the tef gene pool. The putative wild progenitors of tef, including diploids, showing high sequence similarity to the tef genome are clustered with the tef cultivars in the phylogenetic tree. Given the limited funding available, this could help minimize the species included in further genomic studies. The data generated here represents the most taxonomically inclusive genomic resource developed from Eragrostis species to date and demonstrates the potential of GBS as an alternative genotyping platform for tef genomics research for crop researchers with limited resources for genome sequencing. It also provides genome-scale genomic resources and framework to inform and guide additional genomic studies of the species for tef breeding research. The phylogenetic tree using the waxy gene suggests that E. aethiopica and E. pilosa are the closest relatives to tef, with E. heteromera the closest known diploid to the tef B genome and E. aethiopica within the tef clade of the A genome. To fully exploit the GBS data, sequencing of all subgenomes of the Eragrostis clade is vital for future diversity studies.
Supplementary Materials: The following are available online at http://www.mdpi.com/1424-2818/10/2/17/s1: Table S1. Germplasm class information and source; Table S2. Summary of the ApeKI Eragrostis GBS; Table S3. Number of read counts and mapping rate to the tef genome; Table S4. Number of SNPs detected on individual pseudomolecules and SNP density; Figure S1. Mean sequencing depth of the ApeKI Eragrostis library generated by the GBS protocol; Figure S2. SNP coverage statistics; Figure S3. SNP number vs. pseudomolecule length; Figure  S4. Population structure at different clustering levels; Figure S5. Phylogenetic tree of the wild Eragrostis species.