A Simple Strategy for Development of Single Nucleotide Polymorphisms from Non-Model Species and Its Application in Panax

Single nucleotide polymorphisms (SNPs) are widely employed in the studies of population genetics, molecular breeding and conservation genetics. In this study, we explored a simple route to develop SNPs from non-model species based on screening the library of single copy nuclear genes (SCNGs). Through application of this strategy in Panax, we identified 160 and 171 SNPs from P. quinquefolium and P. ginseng, respectively. Our results demonstrated that both P. ginseng and P. quinquefolium possessed a high level of nucleotide diversity. The number of haplotype per locus ranged from 1 to 12 for P. ginseng and from 1 to 9 for P. quinquefolium, respectively. The nucleotide diversity of total sites (πT) varied between 0.000 and 0.023 for P. ginseng and 0.000 and 0.035 for P. quinquefolium, respectively. These findings suggested that this approach is well suited for SNP discovery in non-model organisms and is easily employed in standard genetics laboratory studies.


Introduction
Detection and assessing the genetic variations of a given species is one of the fundamental issues in biology. Since Mendel initially developed the phenotype-based genetic markers in his experiments, the identification and employment of genetic markers have made great progress in the past decades [1]. Specifically, a series of molecular markers have been explored due to the advances in molecular technologies. For example, restriction fragment length polymorphism (RFLP) is the first DNA marker that provides an efficient molecular tool to evaluate the genetic variation of a species [2]. This hybridization-based technique is widely utilized to detect DNA polymorphisms because of its relatively high polymorphic, co-dominantly inherited and highly reproducible. In addition, the development of polymerase chain reaction (PCR)-based molecular markers, such as random amplified polymorphic DNA (RAPD), amplified fragment length polymorphism (AFLP) and microsatellite, also supply an array of approaches that yield a large number of genetic variations in different organisms [3,4]. For instance, the microsatellite markers are broadly employed as a reliable DNA marker for multiple purposes across a wide range of species, including QTL tagging, population genetics, molecular breeding and phylogenetic analysis [5][6][7].
In recent years, however, the availability of abundant genetic resources for numerous organisms is contributing to a transition to the use of single nucleotide polymorphisms (SNPs) [8]. In particular, recent progresses in the cost and accuracy of high throughput sequencing technologies are revolutionizing the opportunities for producing genetic resources in different organisms [9]. For example, Geraldes et al. [10] have identified 0.5 million putative SNPs in 26,595 genes of the model species black cottonwood (Populus trichocarpa) using high-throughput sequencing technology. Similarly, Howe et al. [11] have also characterized 278,979 unique SNPs from the non-model species Pseudotsuga menziesii through screening of a reference transcriptome. Notably, although the next generation sequencing platforms have generated a large numbers of SNPs in both model and non-model organisms, some of these DNA polymorphisms are distributed in the duplicate regions of the genome (i.e., different members of the same gene families) that might result in the paralogous sequence variants (PSVs) and eventually limit the utilization of SNPs. Therefore, there is an urgent need to develop reliable SNPs from single copy nuclear genes (SCNGs) that could be used for applications such as molecular phylogenetics and genetic mapping. To this end, we explored a simple and straightforward approach to characterize SNPs from Panax ginseng C.A. Meyer and P. quinquefolium L. by screening the constructed library of Arabidopsis SCNGs. Panax L. (Araliaceae), commonly known as ginseng, is a medicinally important genus in the Orient and includes 18 species with 16 from eastern Asia and two from eastern North America [12,13]. P. ginseng is one of the highest valued medicinal species within Panax. Although P. ginseng was widely distributed in Russia, Korea and China at the beginning of 20th century, there exists only a few individuals in natural environments due to the over exploitation of wild resources and the destruction of natural habitats [14,15]. To date, P. ginseng has been listed as a rare and endangered plant in China [16]. Similarly, P. quinquefolium L. (American ginseng) is also a medicinal plant which is native to North America and widely cultivated in China [17]. Results from molecular phylogenetic analyses revealed that P. ginseng and P. quinquefolium are most closely related species within this genus [12,13]. To explore reliable SNPs from P. ginseng and P. quinquefolium, we developed 16 single copy nuclear genes (SCNGs) from the Panax dbEST of GenBank (http://www.ncbi.nlm.nih.gov/dbEST/index.html) (1 October 2012) [18]. These SCNGs may provide a series of useful molecular markers for future studies of conservation genetics.

Development of SNPs from SCNG Library
The predominant type of molecular genetic marker has changed substantially over the past decades [8]. To date, SNP markers have come to prominence due to the abundant polymorphism in genomes, low-scoring error rates and relative ease of calibration among laboratories [9]. Specifically, with the advances in DNA sequencing technologies, SNP markers have contributed greatly to the genetic studies of model organisms. A large numbers of SNPs were retrieved from Arabidopsis, Oryza and Populus via the employment of high throughput DNA sequencing platforms [19][20][21][22]. Nonetheless, the application of SNPs in non-model species lagged behind because of the limitation of marker development and the existence of PSVs [23]. Although the strategies of transcriptome and reduced representation genomic libraries sequencing also generated a numbers of SNPs from different non-model organisms, utilization of these SNP discovery approaches as standard tools in non-model species remain challenging thus far [24][25][26][27]. The main stumbling block hindering wide adoption of SNPs in non-model organisms is that these next generation sequencing technology-based approaches are too expensive for the population level analysis, in particular to these studies with large sample size, because it is sometimes impossible to assemble all the short reads without a reference genome. In addition, it is also difficult to distinguish the sequencing errors and PSVs from true SNPs. Take the maize as an example, it has been demonstrated that although millions of SNPs were identified, only a small portion of those polymorphisms could be utilized for the further development of robust and versatile assays [28]. To this end, we explored a simple strategy to develop SNPs from non-model species P. ginseng by performing a BLAST homology search against the constructed SCNGs library of Arabidopsis. Accordingly, a total of 22,824 Panax ESTs were analyzed and 542 of them showed high similarity to the references of Arabidopsis SCNGs. Forty-five primer pairs were designed from the exon regions of Panax SCNGs, of which 16 primer pairs produced clear amplicons of the expected size in P. ginseng ( Table 1) and ten of which were successfully amplified in P. quinquefolium (Table 2). To ensure whether the SNPs were actually retrieved from the orthologous genes, we have analyzed the genetic divergence of all the obtained clones for each putative SCNG. As expected, only a small amount of sites showed single nucleotide variation and almost all of the retrieved SNPs were found at these sites. These attributes suggested that the 16 nuclear genes are likely single copy nuclear gene in P. ginseng and P. quinquefolium. Through screening DNA polymorphisms of the 16 SCNGs, we successfully identified 160 and 177 SNPs from P. quinquefolium and P. ginseng, respectively. In addition, the obtained sequences of SCNGs produced alignments ranging from 278 base pair (bp) to 1,339 and 286 to 853 bp in P. ginseng and P. quinquefolium, respectively. All of these DNA sequences have been submitted to GenBank under the accession numbers of KF529139-KF529528.

Nucleotide Diversity in P. ginseng and P. quinquefolium
These SNPs can be employed to investigate the molecular phylogenetics, population genetic and molecular breeding of the Panax species. For example, although several previous studies have employed allozyme, random amplification polymorphism DNA (RAPD), inter simple sequence repeat (ISSR), amplification fragment length polymorphism (AFLP) and microsatellite techniques to investigate the genetic diversity of P. ginseng, these genetic markers are largely from unknown regions of the genome and can not be applied among laboratories that might have less practical value in the further studies [14,15,29]. In this study, we applied these SNPs to evaluate the nucleotide diversity of P. ginseng and P. quinquefolium. Results from the polymorphic loci of P. ginseng revealed that nucleotide diversity ranged from 0.001 to 0.023 for total sites (π T ) and from 0.000 to 0.017 for nonsynonmous sites (π Non ), respectively (Table 1). Similarly, nucleotide diversity of P. quinquefolium varied from 0.005 to 0.035 for total sites and from 0.000 to 0.079 for nonsynonmous sites, respectively ( Table 2). The genetic diversity based on SNP markers has been also reported in some other crop plants. For example, Haudry et al. [30] have employed 21 nuclear genes to investigate the genetic diversity of Triticum turgidum ssp. dicoccum and revealed that this species possessed low genetic diversity (π T = 0.0008). Likewise, low genetic diversity were also found in Zea may ssp. may (π T = 0.0064) and Hordeum vulgare (π T = 0.0031) [31,32]. In comparison with these previous studies, our results showed that although a small amount of individuals of P. ginseng and P. quinquefolium were investigated respectively, both the two Panax species exhibited relatively high level of nucleotide diversity at both total (π T are 0.007 and 0.017 for P. ginseng and P. quinquefolium, respectively) and nonsynonmous (π T are 0.004 and 0.024 for P. ginseng and P. quinquefolium, respectively) sites. Notably, we found that P. quinquefolium showed relatively higher genetic diversity at both total and nonsynonmous sites in comparison with P. ginseng. It indicated that P. ginseng might have undergone genetic bottleneck during the domestication process. In addition, Wang et al. [33] have developed an amplification refractory mutation system (ARMS)-PCR method and successfully applied it to identify the ginseng cultivars. Here, our results showed that no haplotypes were shared between P. ginseng and P. quinquefolium. It suggested that these molecular markers could be employed to distinguish the two Panax species.

Samples and DNA Extraction
SNPs discovery was assessed in samples from 20 individuals of P. ginseng and ten individuals of P. quinquefolium. The detailed information of the specimens was listed in Table 3. In general, the 20 samples were collected from ten localities and each of them contained two individuals. Similarly, the ten individuals of P. quinquefolium were also obtained from two locates. Genomic DNA was extracted from leaves of each individual using a Plant Genomic DNA kit (TianGen, Beijing, China) following the manufacturer's protocols.

SCNG Library Construction and Primer Design
To obtain SNPs from P. ginseng and P. quinquefolium, library of SCNGs was constructed based on the database of putative SCNGs (see in the Supplementary File 1). In detail, available references of Arabidopsis were retrieved from GenBank according to the accession number of Duarte et al. [34]. Then, all ESTs and genomic sequences of Panax were downloaded from GenBank and aligned against the constructed SCNGs library of Arabidopsis using Basic Local Alignment Search Tool (BLAST). For aligned EST sequences that satisfy minimum matched query length of 200 nucleotides and identify of 80% were considered as valid hits. To further identify the gene structures of SCNGs in P. ginseng, we blasted these ESTs against the BLASTX of GenBank (http://www.ncbi.nlm.nih.gov/dbEST/index.html) [35]. The exon-intron boundaries of SCNGs were determined by available annotated references. The identified Panax ESTs were subjected to design primers using the software Primer Premier 5.0 (Premier Biosoft International, Palo Alto, CA, USA).

PCR, Sequencing and Gene Function Prediction
The designed primer pairs were further employed to amplify the target fragments of P. ginseng and P. quinquefolium. PCRs were performed using an ABI 2720 Thermocycler (Applied Biosystems, Foster City, CA, USA) in a 30 μL total volume containing: 20-50 ng template DNA, 1× PCR buffer (Mg 2+ free), 2.5 mM Mg 2+ , 0.6 μM of each primer, 0.2 mM of each dNTP, 1 unit of rTaq polymerase (Takara, Dalian, Liaoning, China). The amplifications were performed under the following conditions: 94 °C for 5 min, 35 cycles of 30 s at 94 °C, 30 s at the annealing temperature (Tables 1 and 2) for each designed specific primer, 90 s at 72 °C, and a final extension of 72 °C for 8 min. All amplified products were separated by electrophoresis on 1.5% agarose gels and purified with the Gel DNA Recovery Kits (Takara) following manufacturer's instructions and sequenced with the ABI3730 sequencer (Beijing Invitrogen Biotechnology CO., Ltd., Beijing, China). Previous studies have documented that P. ginseng and P. quinquefolium are tetraploid species [36][37][38]. To ensure all the SNPs were retrieved from the orthologous genes, we have therefore sequenced more than 10 clones from the same individual for each putative SCNGs and analyzed the genomic divergence of the obtained sequences. To further determine the function of SCNGs, the obtained genomic sequences were searched against the GenBank non-redundant protein database of Arabidopsis thaliana using BLASTX [35] with an expected value <10 −7 . The putative functions of these SCNGs are listed in Table 1.

SNP Genotyping and Data Analyses
Obtained DNA sequences of P. ginseng and P. quinquefolium were subsequently subjected to identify the SNPs. Initial sequence editing and assembly was performed using the ContigExpress (Informax Inc., North Bethesda, MD, USA, 2000). DNA sequence alignment was implemented in ClustalX 1.83 [39] and if necessary edited manually in BioEdit 7.0.1 [40]. To evaluate nucleotide diversity of the two Panax species, nucleotide polymorphisms were analyzed using DnaSP version 5 [41], including number of segregating sites (S), number of haplotypes (h) and haplotype diversity (H d ). In addition, we also surveyed nucleotide diversity π [42] for total and nonsynonymous sites for each locus and the combined dataset separately. The insertions/deletions (indels) were not included in these analyses.

Conclusions
SNPs are increasingly being used as an ideal molecular marker in both model and non-model species. Here, we explored an approach of development SNPs from the SCNGs of non-model species P. ginseng and P. quinquefolium. Our results suggested that this strategy could also be applied to develop SNPs in other model or non-model species.