New Transcriptome-Based SNP Markers for Noug (Guizotia abyssinica) and Their Conversion to KASP Markers for Population Genetics Analyses

The development and use of genomic resources are essential for understanding the population genetics of crops for their efficient conservation and enhancement. Noug (Guizotia abyssinica) is an economically important oilseed crop in Ethiopia and India. The present study sought to develop new DNA markers for this crop. Transcriptome sequencing was conducted on two genotypes and 628 transcript sequences containing 959 single nucleotide polymorphisms (SNPs) were developed. A competitive allele-specific PCR (KASP) assay was developed for the SNPs and used for genotyping of 24 accessions. A total of 554 loci were successfully genotyped across the accessions, and 202 polymorphic loci were used for population genetics analyses. Polymorphism information content (PIC) of the loci varied from 0.01 to 0.37 with a mean of 0.24, and about 49% of the loci showed significant deviation from the Hardy-Weinberg equilibrium. The mean expected heterozygosity was 0.27 suggesting moderately high genetic variation within accessions. Low but significant differentiation existed among accessions (FST = 0.045, p < 0.0001). Landrace populations from isolated areas may have useful mutations and should be conserved and used in breeding this crop. The genomic resources developed in this study were shown to be useful for population genetics research and can also be used in, e.g., association genetics.


Introduction
Noug (Guizotia abyssinica (L. f.) Cass.) is among the cultivated species of the family Asteraceae. Similar to other Guizotia species, noug is diploid (2n = 30) with relatively small chromosomes [1][2][3]. It has a larger genome size (1C = 3.8 pg) than its closely related congeners despite having the same chromosome number [4]. For example, the genome size of G. scabra ssp. schimperi, a suggested progenitor of noug, is only 55.3% of that of noug [4]. The smaller chromosomes of G. scabra ssp. schimperi when compared with that of noug [1,3] explains the larger genome size in the latter. An increase in the noug genome over its evolutionary and domestication period is likely due to genetic mechanisms such as gene duplication [5,6] and retrotransposition [7].
Noug is an underutilized but economically important minor oilseed crop mainly cultivated in Ethiopia and India. It is considered as a semi-domesticated crop with its center of origin and diversity in Ethiopia that was later introduced to India and other countries. Its cultivation outside Ethiopia and India covers countries such as Congo, Eritrea, Malawi, Sudan, Tanzania, Uganda, and Zimbabwe in Africa, and Bangladesh, Bhutan, and Nepal in Asia, as well as the Caribbean islands and the USA [8][9][10][11]. Noug has been cultivated mainly with low inputs, as it is recognized as a crop that can grow in poor and waterlogged soils vis-à-vis many other crops. Its ability to grow under poor agricultural management conditions makes it a good candidate for subsistence farming [10,11]. Within Ethiopia, the crop is cultivated at altitudes ranging from 1200 to 2700 m above sea level (masl) but the main cultivation areas are within the range of 1600 to 2200 masl [10,12]. The noug seed oil content can be as high as 50% and the oil is mainly composed of palmitic, stearic, oleic, and linoleic acids, with linoleic acid commonly accounting for more than 65% [12,13]. It has a significant dietary contribution as an important source of seed proteins, carbohydrates, minerals, vitamins, and fiber, in addition to its oil [10,14,15]. These nutrients are obtained through the consumption of whole seeds after being processed in various forms for their unique nutritional, cultural, and medicinal values [16]. These diverse local uses make it the most popular edible oilseed crop in Ethiopia.
Noug is the second-largest oil crop according to both harvest area and total production in Ethiopia, only surpassed by sesame [11]. However, its seed yield is lower than several other edible oil crops grown in the country. Traditional breeding efforts for improving seed yield and related traits have seen low successes, and only a few cultivars have been released [17]. In plant breeding programs, the selection of diverse germplasm possessing desirable characteristics and understanding differences between breeding materials are crucial steps towards cultivar development. Hence, the development and utilization of genome-wide markers are highly desirable to determine and manage genetic diversity within gene pools of crops. In line with this, expressed sequence tags (ESTs) and microsatellite (SSR) markers have been developed for noug [18], and a number of molecular marker-based studies have been conducted mainly to understand its population genetics [19][20][21][22]. However, molecular breeding has not been implemented in noug, mainly because available genomic resources and tools are highly limited. Publicly available genomic resources for noug that include expressed sequence tags (ESTs), complete chloroplast genome, and gene sequences used for phylogenetic studies can be found at the National Center for Biotechnology Information (NCBI) database (https: //www.ncbi.nlm.nih.gov/nuccore/?term=Guizotia%20abyssinica).
Single nucleotide polymorphism (SNP) markers are the most recent and popular DNA markers with a diverse use in the analyses of the genomes of crops. SNP markers are amenable to high throughput genotyping by sequencing technologies and have been extensively used for genotyping of various crops for different applications, including genetic diversity analyses, genome-wide association research, and genetic linkage mapping [23][24][25][26][27][28][29][30][31][32]. However, high-throughput SNP genotyping is less suitable and not cost-effective when the number of target SNPs is in the hundreds or less. In such cases, a relatively low-cost genotyping approach, such as the Kompetitive Allele-Specific PCR (KASP) assay is preferable. SNPs converted to KASP markers can be genotyped in small labs using fluorescent resonance energy transfer (FRET)-capable plate readers and qPCR machines. SNP markers have been successfully converted to KASP markers and used for various applications in different crops, including the faba bean [33], mung bean [34], pea [35], peanut [36], rice [37,38], rye [39], sorghum [40], and wheat [41][42][43].
The objectives of the present study were (1) developing new genomic resources for noug for various applications through transcriptome sequencing of noug genotypes, SNP discovery based on the transcript sequences and converting the SNPs to KASP markers, and (2) genotyping of noug accessions using the newly developed SNP/KASP markers for population genetics analyses.

Plant Material
Two self-compatible noug breeding lines developed from landrace populations C19 and K13 [44] were used for transcriptome sequencing. The two self-compatible lines are significantly different from one another in several characteristics, including earliness, seed shape and size, oil content, and fatty acid composition, and hence were considered as suitable germplasm for SNP discovery. For example, K13 is characterized by early maturity, larger and shinier seeds, lower oil content, and higher oleic acid content when compared with C19. Twenty-four noug accessions grown in Ethiopia (Table S1) comprising 21 landrace populations, two released cultivars, and one breeding population (developed through crossbreeding genotypes with a high oil content) were used for genotyping using newly developed SNP/KASP markers.

RNA Extraction, Transcriptome Sequencing and Assembly, and SNP Calling
The two self-compatible lines were planted in a greenhouse at the Swedish University of Agricultural Sciences (SLU), Alnarp, for RNA extraction. Total RNA was separately extracted from leaf tissue of two-week-old seedlings of the two self-compatible lines using a spectrum plant total RNA kit (Sigma-Aldrich, Stockholm, Sweden). The extracted RNA was treated by Qiagen RNase free DNase (Qiagen, Stockach, Germany) to get rid of any DNA. The RNA samples were then sent to the University of California (Davis) and library preparation, transcriptome sequencing, assembly, and SNP analysis were conducted at the Genome Sciences Center. A non-normalized cDNA library was prepared following the Illumina's guidelines and sequencing was done on the Illumina GAII, as described for the dahlia in Hodgins et al. [45]. Transcriptome assembly was done as described for Illumina sequences in Hodgins et al. [45]. A total of 4781 previously developed transcript sequences based on USDA-ARS accession PI 508077, which were functionally annotated using the Arabidopsis Information Resource (TAIR) [18] were used as a reference for SNP discovery.
After transcriptome assembly, SNP calling was made through aligning the transcript sequences to the reference sequences, and mapping the reads to the aligned sequences using BWA [46], SAMtools [47], and in-house Perl scripts. The reads of each genotype were mapped separately using the BWA aligner to generate two BAM (binary version of sequence alignment/map format) files. The SNPs between the two genotypes were determined using SAMtools. This was followed by generating a genotype table using custom-written Perl scripts in which the corresponding nucleotides of the reference transcripts are lined up with the genotype call for the two genotypes. High-quality SNPs were then obtained by filtering SNPs with at least 6× coverage. Among these, SNPs that are homozygous for both parents, having a mapped BLAST (Basic Local Alignment Search Tool) hit in the sunflower (Helianthus annuus), and only varying from the reference in one parent were further selected. SNPs that varied from the reference in only one parent were targeted in order to use them for genotyping of a mapping population developed using these genotypes as parents. These SNPs were further filtered by eliminating SNPs with varying sites within a 10-bp range on either side or SNPs with indels in either parent. All cases where more than one noug gene hits the same mapped Helianthus contig were also removed to avoid close paralogs. This filtering procedure resulted in 959 SNPs within 628 noug contigs. The accession numbers and full sequences of the 628 reference transcripts as well as the SNP sites, reference, and alternate alleles at each of 959 SNP loci are provided in Table S2.

Planting, Sampling, and DNA Extraction
The 24 noug accessions were planted in a greenhouse at SLU, Alnarp, for DNA extraction. A young leaf tissue was collected from two-week-old seedlings separately from individual plants for each accession using a LGC plant sample collection kit (KBS-9370-001) provided by LGC-Genomics (https://biosearch-cdn.azureedge.net/assetsv6/Plant-leaf-kit.pdf). Each accession was represented by 12 individuals except NG099 and NG108 (Table S1) that had 7 and 10 individuals, respectively. The 281 samples, representing the 24 accessions, collected into three deep-well plates (96-well), were then sent to LGC-Genomics (Hoddesdon, UK) where DNA extraction and genotyping were conducted. High-quality genomic DNA suitable for KASP genotyping was extracted using the sbeadex plant kit (https://www.biosearchtech.com/products/extraction-and-purification-reagents/dna-purificationkits/sbeadex-kits) at the LGC-Genomics facility.

Competitive Allele-Specific PCR (KASP) Assay Design and Genotyping
A file containing the 628 reference transcript sequences, the reference and alternate alleles of the 959 SNP loci, and their positions in the reference sequences were provided to LGC-Genomics. These data were then used for designing a competitive allele-specific PCR (KASP) assay for each SNP locus. Among the 959 SNP loci, a KASP assay was successfully designed for 931 loci, whereas 28 of them failed the KASP assay design. Hence, 931 SNP loci were used for the genotyping of the 281 samples. The genotypic data was received from LGC-genomics and analyzed using LGC's KlusterCaller software (https://www.biosearchtech.com/support/tools/genotyping-software/klustercaller) and the genotype clusters were viewed using LGC's SNP-viewer application (https://www.biosearchtech.com/support/ tools/genotyping-software/snpviewer) that displays the two homozygotes and the heterozygote classes as separate clusters plate by plate.

Statistical Analyses
Various population genetics parameters were estimated using different statistical software. Genetic diversity indices were estimated for each accession or locus using Popgen32 [48], GenAlEx version 6.5 software [49], Arlequin ver 3.5 [50], and MEGA7 [51]. Pogen32 was used to calculate percent polymorphic loci and gene flow for each accession across all loci as well as the observed and effective number of alleles and allele frequency for each locus across all accessions. GenAlEx was used to calculate the Shannon information index, observed heterozygosity, standard and unbiased expected heterozygosity, and fixation indices for each accession. Arlequin was used to calculate Theta under the stepwise mutation model [52]. Tajima-Nei model [53] based estimates of average evolutionary divergence for each population were calculated using MEGA7. Arlequin was also used for the analysis of molecular variance (AMOVA) and for the Hardy-Weinberg equilibrium (HWE) test. MEGA7 was used for neighbor-joining-based cluster analysis using evolutionary distances computed based on the Tajima-Nei method [53], whereas GenAlEx was used for principal coordinate analysis (PCoA).
The program STRUCTURE (v. 2.3.4) [54] was used for population structure analysis using all loci and individuals. In this Bayesian approach-based analysis, an admixture model with a length of burn-in period of 100,000 and number of Markov chain Monte Carlo (MCMC) replications of 100,000 was used. The structure analysis was run for K = 1 to K = 15 with 20 runs at each K. The output of STRUCTURE was then used as input data for the STRUCTURESELECTOR program [55] for (1) determination of the optimal number of genetic clusters (K) using the ∆K method of Evanno et al. [56], and (2) graphical representation of the population genetic structure of the 24 accessions for the optimum K value through the application of the integrated CLUMPAK program of Kopelman et al. [57].

The KASP Genotyping and the SNP Loci
Genotypic data across 202 SNP loci (Table S2) were used for population genetics analyses of the 24 accessions. Among the 931 SNPs targeted in the KASP genotyping, 554 were successfully genotyped across the 24 accessions, while 377 failed. Hence, the success rate of the KASP genotyping in this study is 59.5%. Of the 554 loci successfully genotyped, 286 (51.6%) were monomorphic across all individuals, whereas 268 (48.4%) were polymorphic. Among the 268 polymorphic loci, 66 of them (24.6%) had >10% missing data and were not used for further data analysis.
The effective number of alleles across the 202 loci varied from 1.43 to 1.48 with a mean of 1.45. Expected heterozygosity (He) varied from 0.01 to 0.50 with a mean of 0.29 whereas polymorphism information content (PIC) varied from 0.01 to 0.38 with a mean of 0.24 ( Figure 1; Table S3). A large variation was observed in fixation indices among the SNP loci. The minimum, maximum, and mean values for F IS were −0.90, 1.00, and 0.13, for F IT were −0.89, 1.00, and 0.21, and for F ST were 0.01, 0.24, and 0.10, respectively ( Figure 1A). The estimate of gene flow (Nm) for each locus showed wide variation, ranging from 0.80 to 36.65, with a mean of 3.17 ( Figure 1B; Table S3).
Genes 2020, 11, x FOR PEER REVIEW 5 of 23 The effective number of alleles across the 202 loci varied from 1.43 to 1.48 with a mean of 1.45. Expected heterozygosity (He) varied from 0.01 to 0.50 with a mean of 0.29 whereas polymorphism information content (PIC) varied from 0.01 to 0.38 with a mean of 0.24 ( Figure 1; Table S3). A large variation was observed in fixation indices among the SNP loci. The minimum, maximum, and mean values for FIS were −0.90, 1.00, and 0.13, for FIT were −0.89, 1.00, and 0.21, and for FST were 0.01, 0.24, and 0.10, respectively ( Figure 1A). The estimate of gene flow (Nm) for each locus showed wide variation, ranging from 0.80 to 36.65, with a mean of 3.17 ( Figure 1B; Table S3). The Hardy-Weinberg equilibrium (HWE) test revealed that 50.7% of the loci are at HWE whereas 49.3% of loci showed significant deviation from HWE ( Figure 2; Table S3). A total of 43.4% of the loci showed heterozygote deficiency with 33.5% and 9.9% showing highly significant (p < 0.01) and significant (0.01 < p < 0.05) deviation, respectively. On the other hand, 5.9% of the loci showed excess heterozygosity with 4.4% and 1.5% showing highly significant (p < 0.01) and significant (0.01 < p < 0.05) deviation, respectively. Examples of the SNP loci showing a highly significant deviation from HWE and the description of their corresponding sunflower homologs are provided in Table 1. Twelve and four of them represent loci with heterozygote excess and deficiency, respectively. Interestingly, 10 of the 12 loci that showed heterozygote excess lacked one of the three possible genotypes expected in a bi-allelic polymorphic locus under the assumption of HWE. Similarly, all four loci with heterozygote excess lacked one of the three possible genotypes. The minor allele frequency (MAF) of the four and 12 loci ranged from 0.185 to 0.470 and 0.120 to 0.470, respectively. Among these 16 SNP loci, the change in amino acid sequences of the corresponding genes was obtained in only one locus (locus 3143A). The other 15 SNPs are synonymous substitutions (Table 1). In the case of the nonsynonymous substitution, the SNP resulted in a Serine/Arginine exchange (Table 1). The Hardy-Weinberg equilibrium (HWE) test revealed that 50.7% of the loci are at HWE whereas 49.3% of loci showed significant deviation from HWE ( Figure 2; Table S3). A total of 43.4% of the loci showed heterozygote deficiency with 33.5% and 9.9% showing highly significant (p < 0.01) and significant (0.01 < p < 0.05) deviation, respectively. On the other hand, 5.9% of the loci showed excess heterozygosity with 4.4% and 1.5% showing highly significant (p < 0.01) and significant (0.01 < p < 0.05) deviation, respectively. Examples of the SNP loci showing a highly significant deviation from HWE and the description of their corresponding sunflower homologs are provided in Table 1. Twelve and four of them represent loci with heterozygote excess and deficiency, respectively. Interestingly, 10 of the 12 loci that showed heterozygote excess lacked one of the three possible genotypes expected in a bi-allelic polymorphic locus under the assumption of HWE. Similarly, all four loci with heterozygote excess lacked one of the three possible genotypes. The minor allele frequency (MAF) of the four and 12 loci ranged from 0.185 to 0.470 and 0.120 to 0.470, respectively. Among these 16 SNP loci, the change in amino acid sequences of the corresponding genes was obtained in only one locus (locus 3143A). The other 15 SNPs are synonymous substitutions (Table 1). In the case of the non-synonymous substitution, the SNP resulted in a Serine/Arginine exchange (Table 1).
The analysis of molecular variance (AMOVA) was conducted without grouping the accessions as well as by grouping them according to their geographic region of origin or altitudinal range of their collection sites ( Table 3). The analysis showed that 95.5% of the total variation accounted for variation within accessions whereas 4.5% accounted for the variation among them (F ST = 0.045, p < 0.0001). The vast majority of the within accession variation (93.9%) was attributed to the variation within individuals (heterozygosity). Hierarchical AMOVA was conducted by grouping 21 of the 24 accessions (excluding the two cultivars and the breeding population) into (1) three altitudinal groups, (2) six geographical regions (regions-I), and (3) two geographical regions (regions-II) ( Table 3). However, only 0.12% of the total variation accounted for the variation among the altitudinal groups, which is not significant (F CT = 0.001 and p = 0.19). Similarly, there was no significant differentiation between the six regions-I groups (F CT = −0.001 and p = 0.67). However, there was significant differentiation between the two regions-II groups (Oromia vs. Amhara-Tigray) (F CT = 0.002 and p = 0.047).   (Table S1).
AMOVA-based F ST was also computed for each pair of the 24 accessions, and 268 of the 276 pairs (97.1%) showed significant differentiation between them with F ST values ranging from 0.017 to 0.124 (Table 4). Only eight pairs failed to show significant differentiation (F ST ranging from 0.006 to 0.012). The lowest and the highest F ST values were recorded for NG111 vs. NG123 and NG096 vs. NG097, respectively ( Table 4). The mean F ST values that reflect the average differentiation of each accession from all other accessions ranged from 0.028 (Shambu) to 0.076 (NG097). Accessions NG096 and NG124 also showed higher differentiation from the other accessions having mean F ST values of 0.074 and 0.072, respectively. On the other hand, Fogera and NG123 are among the least differentiated accessions with mean F ST values of 0.031 and 0.033, respectively.  One hundred twenty-six individuals having genotypic data for all loci (no missing values) were selected across the 24 accessions and the evolutionary distance between each pair of individuals was computed for loci with MAF below 0.3 using Tajima-Nei method [53]. The neighbor-joining cluster analysis based on the evolutionary distances between the individuals resulted in three major clusters and several sub-clusters (Figure 3). Except in a few cases, individuals from the same accessions were placed in more than one cluster. For example, accession NG105 was represented by eight individuals, of which three, three, and two individuals were placed in cluster-I, II, and III, respectively. On the other hand, all six individuals of Fogera, a cultivar, were clustered in cluster-I and all four individuals of NG111 were placed in cluster-III. The Tajima-Nei method [53] based evolutionary distance was also used for neighbor-joining cluster analysis at the accession level using three sets of loci: all polymorphic loci, only loci with MAF of <0.3, and only loci with MAF of ≥0.3 ( Figure 4). As depicted in Figure 4A-C, the clustering patterns of the accessions are clearly different among the three data sets. The analysis revealed that the clustering patterns of the accessions according to their geographic regions or altitudinal range of origin were poorly defined.
One hundred twenty-six individuals having genotypic data for all loci (no missing values) were selected across the 24 accessions and the evolutionary distance between each pair of individuals was computed for loci with MAF below 0.3 using Tajima-Nei method [53]. The neighbor-joining cluster analysis based on the evolutionary distances between the individuals resulted in three major clusters and several sub-clusters (Figure 3). Except in a few cases, individuals from the same accessions were placed in more than one cluster. For example, accession NG105 was represented by eight individuals, of which three, three, and two individuals were placed in cluster-I, II, and III, respectively. On the other hand, all six individuals of Fogera, a cultivar, were clustered in cluster-I and all four individuals of NG111 were placed in cluster-III. The Tajima-Nei method [53] based evolutionary distance was also used for neighbor-joining cluster analysis at the accession level using three sets of loci: all polymorphic loci, only loci with MAF of <0.3, and only loci with MAF of ≥0.3 ( Figure 4). As depicted in Figure 4A-C, the clustering patterns of the accessions are clearly different among the three data sets. The analysis revealed that the clustering patterns of the accessions according to their geographic regions or altitudinal range of origin were poorly defined.  (Tajima and Nei 1984). The individual samples were coded in a way that the first two or three digits/letters represent their accessions and the last two-digit numbers represent the codes for the plant in that accession. The accession names are given without the two initial letters (NG), and Fog and Sha represent Fogera and Shambu, respectively. Individuals represented by the same shape and color belong to the same accession.  (Tajima and Nei 1984). The individual samples were coded in a way that the first two or three digits/letters represent their accessions and the last two-digit numbers represent the codes for the plant in that accession. The accession names are given without the two initial letters (NG), and Fog and Sha represent Fogera and Shambu, respectively. Individuals represented by the same shape and color belong to the same accession.  [53]. The accession names are given without the two initial letters (NG) for 22 of the 24 accessions, and "Fog" and "Sha" represent Fogera and Shambu, respectively. Accessions represented by the same shape and color belong to the same region, and accessions represented with the same color tree-line belong to the same altitudinal range.
Principal coordinate analysis (PCoA) was also conducted to determine the relationship between the accessions. In the two-dimensional plot generated, the first and the second coordinates explained 24% and 20% of the total variation among the accessions ( Figure 5A). This analysis revealed that most of the accessions were tightly clustered together suggesting low differentiation among them. However, accessions NG097, NG124, and NG096 were clearly separated from the other accessions in this two-dimensional plot.
The admixture model-based population genetic structure analysis conducted using STRUCTURE [54] and STRUCTURESELECTOR programs [55] revealed that the optimal number of genetic clusters (K) is three, as per the ΔK method of Evanno et al. [56] ( Figure 5B). Hence, the 281 individuals representing the 24 accessions were reduced to three genetic populations. The graphical representation of the population genetic structure of the 24 accessions at K = 3, clearly showed that all accessions have alleles that originated from the three clusters (genetic populations), thus suggesting low differentiation between the accessions. In line with the results of the PCoA, accessions NG096, NG097, and NG124 showed higher differentiation than the other accessions in this analysis. The alleles of NG096, NG097, and NG124 were predominantly originated from cluster-2 (orange), cluster-3 (purple), and cluster-1 (blue), respectively ( Figure 5C).  [53]. The accession names are given without the two initial letters (NG) for 22 of the 24 accessions, and "Fog" and "Sha" represent Fogera and Shambu, respectively. Accessions represented by the same shape and color belong to the same region, and accessions represented with the same color tree-line belong to the same altitudinal range.
Principal coordinate analysis (PCoA) was also conducted to determine the relationship between the accessions. In the two-dimensional plot generated, the first and the second coordinates explained 24% and 20% of the total variation among the accessions ( Figure 5A). This analysis revealed that most of the accessions were tightly clustered together suggesting low differentiation among them. However, accessions NG097, NG124, and NG096 were clearly separated from the other accessions in this two-dimensional plot.
The admixture model-based population genetic structure analysis conducted using STRUCTURE [54] and STRUCTURESELECTOR programs [55] revealed that the optimal number of genetic clusters (K) is three, as per the ∆K method of Evanno et al. [56] ( Figure 5B). Hence, the 281 individuals representing the 24 accessions were reduced to three genetic populations. The graphical representation of the population genetic structure of the 24 accessions at K = 3, clearly showed that all accessions have alleles that originated from the three clusters (genetic populations), thus suggesting low differentiation between the accessions. In line with the results of the PCoA, accessions NG096, NG097, and NG124 showed higher differentiation than the other accessions in this analysis. The alleles of NG096, NG097, and NG124 were predominantly originated from cluster-2 (orange), cluster-3 (purple), and cluster-1 (blue), respectively ( Figure 5C).

Discussion
The present study used transcriptome sequencing for the development of novel SNP markers for noug, which were used to design KASP assays and for genotyping 24 Ethiopian noug accessions. The fact that transcript sequences of only two genotypes were used for SNP discovery and the multi-step stringent procedures followed to identify and filter the SNP markers led to a relatively small number of SNPs (959). The use of a few genotypes as an SNP discovery panel can lead to an ascertainment bias [58,59]. This means that our SNP discovery approach may have left rare alleles in noug gene pool corresponding to the target transcripts undiscovered. However, the 202 polymorphic loci used in this study included loci with minor allele frequency (MAF) ranging from below 1% (rare alleles) to almost 50%, and hence the SNP discovery approach is not expected to affect the results of population genetics analyses. Although 97% of these SNPs passed the KASP assay design, only 59.5% of them were successfully used for genotyping the 24 accessions. In other words, 377 SNPs that passed the assay design (40.5%) failed at the genotyping stage. The failure could be because the SNPs do not exist in the germplasm targeted for genotyping or the primers failed to anneal to the target sequences due to sequence variation. In this study, only bi-allelic SNPs developed based on the two genotypes were used. If such SNPs are restricted to a small subset of the crop's gene pool, it is highly likely that they fail (including the case when a nucleotide at SNP locus is different from both alleles) when applied to a diverse germplasm. Further research through resequencing of the target regions using Sanger sequencing or targeted genotyping by sequencing methods, such as SeqSNP (LGC Genomics) will shed light on the factors behind the failure.
Genetic polymorphism research remains key for the understanding of variability in organisms' genetic makeup, and it is usually a prerequisite for the analysis of genetic variation for use in conservation and practical plant breeding. The development of genomic resources and tools for a crop is a crucial step both for the determination of genetic diversity and for the development of DNA markers associated with desirable traits for use in marker-aided breeding, and such resources and tools are often lacking for underutilized and minor crops. In the present study, useful and novel genetic information was gathered for noug despite the moderate success rate of the KASP genotyping.

The SNP/KASP Markers
SNP loci can be bi-, tri-, or tetra-allelic [60,61] but bi-allelic SNPs are the most commonly used because of their abundance and simplicity for use in genetic analyses. However, the level of polymorphism of bi-allelic SNPs may be lower, on average, when compared with other types of SNPs. Similar to other types of molecular markers, transcriptome-based SNPs are generally less polymorphic than non-genic SNPs. In this study, only 48.4% of the genotyped loci were polymorphic, which is not surprising as the SNPs are located in the exons of their corresponding genes and a limited number of accessions were analyzed. Polymorphism information content (PIC) and heterozygosity are common measures of polymorphism of a marker locus [62,63]. In the present study, the PIC of each SNP locus was calculated according to Hildebrand et al. [62]. In this approach, the maximum PIC value for bi-allelic SNPs is 0.375, and it is obtained when both alleles had a frequency of 0.5. The SNPs used in this study are bi-allelic, and their PIC values ranged from 0.01 to 0.37 with a mean of 0.24. Fifty percent of these SNP loci have a PIC value of more than 0.25 and hence are highly informative, and can be prioritized for various applications including for genetic diversity analysis of wider noug genetic resources and its weedy/wild close relatives. Two SNP-based studies in rice cultivars also reported a similar range in PIC values with a mean of 0.23 [64] and 0.28 [65].
In this study, 30% of the SNP loci have a minor allele frequency of less than 0.1, and as a result, the overall mean effective number of alleles was slightly below 1.5. The average observed (Ho) and expected (He) heterozygosities across the 202 loci were 0.25 and 0.29, respectively. In a study on a single noug population conducted using 43 EST-SSR markers, Dempewolf et al. [18] reported Ho and He values of 0.49 and 0.54, respectively. A separate study using a subset of these EST-SSR markers in 29 noug populations resulted in slightly lower values (Ho = 0.40 and He = 46) [22]. The average Nei's gene diversity (Hs), an equivalent of expected heterozygosity, estimated based on random amplified polymorphic DNA (RAPD) markers [19] and amplified fragment length polymorphism (AFLP) markers [20] were 0.18 and 0.21, respectively. Similar levels of variation were obtained in Guizotia scabra, a closely related wild/weedy species [66]. These values are a bit lower than values obtained in the present study although the total number of alleles in the RAPD and AFLP based studies were 376 and 966, respectively [19,20]. This is partly because both RAPD and AFLP are dominant markers that generally underestimate the polymorphism level of a marker locus. The results reflect the advantage of co-dominant markers, such as SNPs and SSRs over dominant markers as have been reported in various previous publications [67][68][69]. On the other hand, the higher genetic variation reported in an EST-SSR (multi-allelic)-based study in noug [22] when compared to the present study (bi-allelic SNPs) is likely because of a higher effective number of alleles per locus (2.2) and a larger sample size of the former.
The fixation indices (F IT , F IS , and F ST ), also referred to as F-statistics, are measures of inbreeding in terms of total population (T), sub-populations (S), and individuals (I) for each locus [70,71]. All three indices have a maximum value of one. Negative values of F IS and F IT indicate excess heterozygosity and a value of one for these indices indicates 100% homozygosity in each sub-population. For F ST , a measure of differentiation of sub-populations, the maximum value is attained when sub-populations are fixed for different alleles. In the present study, each accession is considered as a sub-population and the 24 accessions together form a total population. Noug is a strictly outcrossing species [11,44,72], and hence the overall observed heterozygosity (Ho) is expected to be higher than expected heterozygosity (He) if all other HWE assumptions are met. However, the mean Ho was less than the mean He for the polymorphic loci in the present study ( Table 2; Table S3). Similarly, Ho was less than He, on average, in an EST-SSR-based study in noug [18,22]. Large variation was observed in fixation indices among the SNP loci. For example, F IS values ranged from −0.90 to 1.00, indicating that some loci are in a state of heterozygote excess whereas some other loci lack heterozygotes. The mean values of F IT , F IS , and F ST revealed in the present study were 0.13, 0.21, and 0.10 in that order. The values of these indices were 0.19, 0.15, and 0.04 in an EST-SSR based study [22]. The results suggest the overall low but significant heterozygote deficiency and low population differentiation in noug. The HWE test in this study revealed that about half of the loci (49.3%) showed significant deviation from HWE ( Figure 2; Table S3). Interestingly, the vast majority of these loci (88%) showed heterozygote deficiency and only 12% showed heterozygote excess. The result clearly showed that different loci are under different kinds and levels of selection pressure, and other evolutionary forces, with overall heterozygote disadvantage in noug. Given that the study is based on genic SNPs, such a result is not unexpected.
Considering the 16 loci with a highly significant deviation from HWE (Table 1), 10 of the 12 loci that showed heterozygote excess lacked one of the two homozygous genotypes although their minor allele frequency (MAF) was as high as 0.47. This suggests that one of the two alleles in each locus makes the homozygous genotypes less fit, or the locus is in linkage disequilibrium (LD) with other locus or loci with a significant fitness value within its corresponding gene or other genes nearby. Among these loci, 3143A and 3143B are within the coding region of the 17.8 kDa class I heat shock protein-like gene. The SNP at the locus 3143A resulted in a Serine/Arginine non-synonymous substitution of the 99th amino acid of the protein coded by this gene. The 17.8 kDa class I heat shock protein is one of the small heat shock proteins in plants that are produced in response to high temperature stress [73]. Hence, further analysis may reveal variation in the response to heat stress among different genotypes of this locus. The SNPs that resulted in synonymous substitution are likely in LD with other locus or loci that affect the fitness of individual genotypes. Similarly, all four loci with heterozygote deficiency (Table 1) lacked heterozygous genotype despite having an MAF ranging from 0.19 to 0.47. However, all four resulted in synonymous amino acid substitution suggesting that these loci are tightly linked to a locus where strong heterozygote disadvantage is manifested. Detailed studies on the genes harboring these 16 SNPs using diverse noug germplasm may shed light on their functions.

Genetic Variation within Accessions
The 24 accessions used in the present study showed a relatively narrow range of variation in terms of percent polymorphic loci (PPL). The accessions with the lowest (NG108, 72%) and highest PPL (NG092, 85%) were from northeastern and eastern Ethiopia, respectively. Interestingly, both accessions are from areas less known for noug cultivation. In terms of genetic variation within accessions (I, He, and uHe), NG108 and NG106 are the lowest whereas NG092 and NG095 are the highest. These landrace populations are from different regions and hence the levels of genetic variation within landrace populations cannot be attributed to regions of cultivation. Similarly, NG092, NG095, and NG106 are landrace populations collected from middle-altitudes (Table S1), and the mean values of I, He, and uHe are similar for the three altitudinal groups, and hence altitude does not seem to have a significant effect on the extent of genetic variation. The estimates of genetic variation within accessions (I, He, and uHe) for the two cultivars (Fogera and Shambu), are close to the overall mean values of each parameter. Noug breeding in Ethiopia involves mass and recurrent selection [17]. Seemingly, the breeding methods used did not significantly affect the genetic variation of these cultivars. It could also be the case that the genetic variation in these cultivars has been increased (after their release) as a result of unintended gene flow from landraces grown in adjacent areas during seed multiplication or regeneration of the cultivars by the Ethiopian Institute of Agricultural Research at their field sites.
The 24 accessions showed several folds of variation in terms of fixation index (F) ranging from 0.013 (NG099) to 0.162 (NG092). Accession NG092 differs from most of the accessions not only in its higher genetic variation but also in having a larger deviation from HWE on average. On the other hand, the accessions are quite similar in terms of showing a low level of variation in diversity parameter Theta (H) that measures nucleotide diversity based on effective population size and mutation rate [52]. The similarly low Theta (H) values for the accessions are related to the case that they have a small variation in PPL, similar effective population size, and each accession has only two alleles at a polymorphic locus (bi-allelic SNPs).
Analysis of average evolutionary divergence over sequence pairs within populations using the Tajima-Nei model [53] produced interesting results. In order to evaluate the effect of allele frequencies in estimating evolutionary divergence between individuals within populations, this analysis was separately conducted for loci with MAF equal to or above 0.3 (EAED 2 ) and loci with MAF below 0.3 (EAED 3 ), in addition to their combined analysis (EAED 1 ). A slightly higher positive correlation (r = 0.79) for EAED 1 vs. EAED 2 than for EAED 1 vs. EAED 3 (r = 0.75) suggests that loci with MAF equal to or above 0.3 may have a higher effect on overall evolutionary divergence within populations than loci with a MAF below 0.3. The lack of correlation between EAED 2 and EAED 3 suggests that the selection of noug landrace populations for conservation and breeding should consider SNP loci with a MAF below 0.3 and equal to or above 0.3 separately, as different populations may be more valuable for conserving higher genetic diversity or in harboring desirable traits in the case of the two groups of loci. For example, accession NG111 is the highest in EAED 2 but the second-lowest in EAED 3 , revealing that it has lower frequencies for MAF below 0.3 when compared to almost all other accessions. This is different from the case in accession NG103, where all three EAEDs are the lowest.
Noug accessions from low altitudes areas (1400 to 1680 masl) had the lowest EAED 2 (0.447) and highest EAED 3 (0.205) compared to the other two altitudinal areas. The results suggest that low-frequency alleles are more common in the lowland than in the highland areas of noug cultivation. This provides a slightly higher weight for lowland areas as sources of new alleles than higher altitude areas. The comparison of the two cultivars and the breeding population on one hand and the landrace populations on the hand clearly indicate that the breeding process did not have negative effects on the overall genetic divergence between individuals with populations, including for loci with a MAF below 0.3. This is probably because the loci included in the present study do not have an effect on the traits targeted for breeding and hence the alleles were not differentially selected.

Genetic Variation among Accessions and Population Structure
Partitioning the total genetic variation into the within and among populations components is an important step in understanding the genetic structure of populations and their adaptation to local environmental conditions. Outcrossing species generally tend to have higher genetic variation within populations than among populations, which is partly attributable to higher gene flow between populations than in the case of self-pollinating species [74,75]. Both the present study and previous DNA marker-based studies in noug [19,20] revealed higher genetic variation within populations than among populations. In the RAPD [19] and AFLP [20] based studies, about 35% and 23%, respectively, of the total variation accounted for population differentiation. In the present study, however, only 4.5% of the total variation differentiated the populations. The significantly lower population differentiation in the present study is mainly because it is based on transcriptome derived bi-allelic SNP markers that are generally more conserved than RAPD and AFLP-based markers.
In an EST-SSR-based study, Dempewolf et al. [22] reported a significant population differentiation accounting for 6% of the total variation, which is quite similar to the result of the present study. An EST-SSR based study in the Ethiopian potato (Plectranthus edulis), an outcrossing species, and a clonally propagating crop, also revealed very low genetic differentiation (2.5%) among populations [76]. Generally, outcrossing species have lower variation among populations than predominately self-pollinating species, as revealed using different marker systems in crops such as sorghum (83% [77,78]; 70% [79,80]), common beans (95% [81]), field peas (41% [82]), and durum wheat (31% [25]). However, it should be noted that self-pollinating species can have a low population differentiation when factors other than the mating system affecting the population structure are strong. For example, variation among populations accounted for only 13% and 6% of the total variation in arabica coffee [83] and korarima [84], respectively.
In agreement with previous noug research [19,20], there was no significant differentiation between the three altitudinal groups of accessions in the present study, as revealed by hierarchical AMOVA. However, alleles specific to each altitudinal group were present (Table S4). Four, three, and one specific allele(s) were recorded for low (1400 to 1680 masl), middle (1820 to 1968 masl) and high (2045 to 2590 masl) altitudinal groups. It is interesting to note the presence of more specific alleles in low-altitude areas than in high-altitude areas despite the fact that noug cultivation is more prominent in high-altitude areas. The result suggests the importance of including low-altitude populations in the noug breeding program. Similarly, there was no significant differentiation between the six regional groups of accessions although the grouping was made under the assumption that higher access to gene flow exists within the groups than among the groups. The marginally significant differentiation between accessions from the Oromia region and Amhara-Tigray region may suggest a slightly higher level of germplasm exchange within the regions than the nation-wide average.
The cluster analysis at the individual genotype level produced three major clusters ( Figure 3). However, there was no clear clustering pattern of genotypes according to their accession, which is in line with the low population differentiation obtained through AMOVA. The lack of significant differentiation among altitudinal and regional groups revealed through AMOVA was also evident in the cluster analysis at the level of accessions (Figure 4). This is the case even when only loci with a MAF below 0.3 (loci close to fixation) was used, suggesting a wide distribution of low-frequency alleles. In the PCoA that explained 44% of the total variation in its first two principal axes, the accessions were tightly clustered together with the exception of NG096, NG097, and NG124. Accession NG097 has the highest mean pair-wise F ST (0.076) followed by NG096 (0.074; Table 4). Accession NG097 lacked alleles at two loci that exist in all other landrace accessions (loci 3926 and 12963A). It also lacked a number of alleles present in the vast majority of accessions. Similarly, NG096 lacked alleles common to most accessions at a number of loci. Unlike other landrace accessions, these two accessions were collected from areas where noug cultivation is rare, and their higher genetic differentiation could be the result of isolation by distance [85] by being kept at a household/local community-level for a long time. Accession NG124 is a breeding population developed through crossbreeding of selected genotypes from different landrace populations for improved oil and oleic acid content. Unlike all other accessions, it lacks an allele at a locus (locus 1117) within a gene that codes for tobamovirus multiplication protein 1-like protein. Tobamovirus multiplication protein 1 is required for efficient multiplication of a tobamovirus in Arabidopsis [86,87]. Further research on this locus is necessary to determine its function in noug.
Various approaches have been used to determine the source of individual genotypes in population genetics research [54,[88][89][90]. Pritchard et al. [54] developed a model-based method of population structure analysis for multi-locus genotypic data, in which populations that are characterized by a set of allele frequencies across loci are assumed. In this approach, individual members of predefined populations are probabilistically assigned to a single cluster (inferred population) or receive joint assignment to more than one cluster if the model determines that they are admixed. In this study, we used 24 predefined populations (accessions) to determine the population genetic structure in noug using this model. The analysis using the ∆K method of Evanno et al. [56] showed that the individuals within these predefined populations most likely originated from three genetic populations (K = 3). Interestingly, all individuals across the predefined populations are the results of admixture from the three genetic populations although it is to a different extent (ranging from 13 to 55% on average, data not shown). This and other points discussed above generally show weak population structure in noug due to population admixture caused by strong gene flow between populations via pollen and a step by step nation-wide germplasm exchange.

Conclusions
In the present study, transcriptome sequencing was conducted, and novel SNP markers were identified in noug. More than 50% of these SNPs were successfully converted to KASP markers and used for the genotyping of noug populations collected from wide geographic areas in Ethiopia. Polymorphic SNP/KASP markers were used for population genetics analyses. The loci with high PIC or showing highly significant deviation from HWE should be prioritized for further research in this crop. The study revealed low but significant differentiation between noug populations. Nonetheless, there was generally a lack of, or poor differentiation between noug at different altitudinal ranges or regions level. This indicates strong gene flow between populations grown at different altitudes and regions, particularly between major noug cultivation areas. However, this study also gave clear indications that landrace populations cultivated in isolated areas deviate in genetic diversity from those populations in common noug cultivation areas. Thus, such populations from isolated areas in Ethiopia may be sources of useful mutations, and hence should be considered for their conservation and use in breeding of this oil crop. Overall, the transcript sequences and the SNP and KASP markers developed in this study are highly useful resources for applications such as population genetics analyses and genome-wide association research. However, the markers are relatively small in number and were developed based on only two genotypes. Hence, methods such as RNAseq based on a larger number of diverse germplasm should be applied to develop DNA markers, in the thousands, that can have wide applications by avoiding potential drawbacks associated with ascertainment bias. This approach facilitates the identification of markers associated with traits of interest through genetic linkage and association mapping, for their use in marker-aided breeding.
Supplementary Materials: The following are available online at http://www.mdpi.com/2073-4425/11/11/1373/s1, Table S1: Accession code and type, and altitude, region, and location of collecting sites, Table S2: GenBank accession numbers of the 628 ESTs containing 959 SNPs initially used in the development of the KASP markers, and grouping of the SNP loci according to the results of the KASP analysis, Table S3: Polymorphic information content (PIC), gene diversity (H), fixation indices, gene flow (Nm), observed (Ho) and expected (He) heterozygosity, and p-value for deviation of each locus from HWE, for the 202 polymorphic SNP loci, Table S4