De Novo SNP Discovery and Genotyping of Iranian Pimpinella Species Using ddRAD Sequencing

: The species of Pimpinella , one of the largest genera of the family Apiaceae, are traditionally cultivated for medicinal purposes. In this study, high-throughput double digest restriction-site associated DNA sequencing technology (ddRAD-seq) was used to identify single nucleotide polymorphisms (SNPs) in eight Pimpinella species from Iran. After double-digestion with the enzymes HpyCH4IV and HinfI, a total of 334,702,966 paired-end reads were de novo assembled into 1,270,791 loci with an average of 28.8 reads per locus. After stringent ﬁltering, 2440 high-quality SNPs were identiﬁed for downstream analysis. Analysis of genetic relationships and population structure, based on these retained SNPs, indicated the presence of three major groups. Gene ontology and pathway analysis were determined by using comparison SNP-associated ﬂanking sequences with a public non-redundant database. Due to the lack of genomic resources in this genus, our present study is the ﬁrst report to provide high-quality SNPs in Pimpinella based on a de novo analysis pipeline using ddRAD-seq. This data will enhance the molecular knowledge of the genus Pimpinella and will provide an important source of information for breeders and the research community to enhance breeding programs and support the management of Pimpinella genomic resources.


Introduction
The genus Pimpinella belongs to the family Apiaceae, together with 250 other genera [1]. Over 46 species of this genus are distributed in the Eastern Mediterranean Region, West Asia, the Middle East, Mexico, Iraq, Turkey, Iran, India, Egypt, Spain and many other warm regions of the world [2]. They are annual or perennial semi-bushy aromatic plants with bisexual flowers with five stamens and two carpels. Species of the genus Pimpinella grow under different climatic conditions and types of soil. The Pimpinella species is known as "Jafari koohi" in Persian, and is commonly used as an aromatic plant and traditionally as a food flavouring, for relief of gastrointestinal spasms and as a carminative digestive [3]. Different species of Pimpinella are famous for their antispasmodic, antioxidant, antimicrobial, expectorant, estrogenic, acariside, insecticidal, anticonvulsant and antifungal properties [4,5]. The valuable therapeutic aspects of Pimpinella are mostly correlated with the existence of sesquiterpenes, phenolic compounds, flavonoids, coumarins, phenylpropanoids and essential oils [6]. The major constituents of the essential oils are kaempherol, quercetin and anethole [7].

ddRAD Library Preparation
ddRAD libraries were prepared according to the method described by Severn-Ellis et al. [28] using 200 ng of extracted DNA. The DNA from each sample was restrictiondigested in a total volume of 18 µL, containing 5 units each of HpyCH4IV and HinfI, as well as NEB 10 × CutSmart Buffer (New England Biolabs, Ipswich, MA. USA). The reaction was incubated at 37 • C for 4 h. Barcoded and common adapters were designed as described by Peterson et al. [26] to complement the restriction overhangs created by HpyCH4IV and HinfI respectively. Each restriction-digested sample was then ligated to a unique 5 barcoded adapter and a common 3 adapter. Ligation reactions were carried out in 40 µL containing the restriction-digested sample, 0.23 µM of the common adapter and 0.5 µM barcoded adapter, respectively, 1U T4 DNA ligase (Invitrogen, Carlsbad, CA, USA), 8 µL T4 ligation buffer (Invitrogen, Carlsbad, CA, USA) and 7 µL of nuclease-free water. Ligation reactions were in-cubated at 22 • C for 2 h and then heat inactivated at 65 • C for 20 min.
Purification and double size selections of the ligated samples was carried out to remove un-ligated adapters and simultaneously select fragments be-tween 250 and 800 bp in size. To remove DNA fragments >800 bp the sample volume was increased to 100 µL by adding 60 µL of nuclease water and then transferred to a 96-well PCR plate containing 50 µL of a 1:4 (0.5×) mixture of AMPure XP Beads (Beckman Coulter, Brea, CA, USA) to PEG buffer (20% PEG w/v, 2.5 M NaCl). After incubation, the beads were collected on a magnetic stand (Invitrogen, Carlsbad, CA, USA). The supernatant was transferred to 20 µL of a 1:1 AMPure XP Beads to PEG buffer (0.7×) mix for the second bead bind to remove fragments <250 bp. The supernatant was removed and the beads con-taining the size selected sample DNA were washed using 80% ethanol. The DNA was eluted in 30 µL nuclease free water.
To enrich the ligated and size selected DNA, PCR amplification was per-formed using 10 µL of size selected DNA, 25 µL of Phusion Hot-Start High-Fidelity Master Mix (Thermo Fisher Scientific, Waltham, MA, USA), 0.5 µM of the PCR1 and 0.5 µM indexed PCR2 primers, respectively. Nucle-ase-free water was added to bring the final volume to 50 µL. The PCR primers used were specific to each adapter and comprised of an Illumina index se-quence and flow cell annealing complimentary sequences [26]. Amplification was carried out at 98 • C for 2 min, 12-18 cycles of 98 • C for 15 s, 62 • C for 30 s, 72 • C for 30 s, final extension for 5 min at 72 • C and held at 4 • C on an Applied Biosystems Veriti Thermal Cycler (Thermo Fisher Scientific, Waltham, MA, USA).
PCR products were purified to remove residual primers and primer di-mers in a 1.5X Ampure XP Bead cleanup step. The DNA concentration of each sample was determined using the Qubit High Sensitivity (HS) assay (Invitro-gen; Waltham, MA, USA). The quality of individual libraries and median frag-ment size was assessed on the LabChip GX Touch (PerkinElmer, Waltham, MA, USA) using the HT DNA HiSens Dual Protocol Reagents (PerkinElmer, Waltham, MA, USA). Equimolar amounts (20-30 nM) of the prepared libraries were pooled. The pooled library underwent a final 0.8X Ampure XP bead cleanup to remove any remaining residual fragments shorter than 200 bp. The concentration of the final bead-cleaned library was determined in preparation for sequencing. Sequencing was carried out at the Garvan Institute of Medical research (Darlinghurst, NSW, Australia), on the Illumina HiSeq XTen (Illu-mina, San Diego, CA, USA) sequencing platform.

Sequence Quality Analysis and Filtering
Sequence reads were de-multiplexed by using the outer dual index bar-code information using STACKS v.2.1 pipeline (Institute of Ecology and Evo-lution, University of Oregon, Eugene, OR, USA) [29] and assigned to sequenced spe-cies. Average read quality and unpaired reads, presence of repetitive sequenc-es and adapter read-through and GCcontent were checked using FastQC v.0.11.4 (Babraham Hall, Babraham Research Campus Cambridge, UK) [30] and multiQC v.1.7 (Department of Biochemistry and Biophysics, Stockholm University, Stockholm, Sweden) [31]. Reads containing the correct restriction sites in R1 and R2 were obtained by searching restriction site sequences in the raw reads respectively. Adapter and sequence trimming were performed us-ing Trimmomatic v.0.36 [32] with default settings and reads were truncated to a uniform 146 bp (R1) and 151 bp (R2), with all shorter reads being discarded.

De Novo Assembly, Read Alignment and SNP Identification
De novo mapping and SNP calling was performed using STACKS. De novo assembly was carried out with a minimum stack size of three reads (m = 3) and a maximum distance between stacks of three (M = 3). Sam-ple-specific stacks were assembled into homologous stacks if they had a maximum distance of three nucleotides (n = 3) between samples. To limit false SNP identification and increase accuracy of downstream analyses, SNPs with a minor allele frequency (MAF) <0.05 and more than two missing genotype calls were discarded using VCFtools 0.1.15 (Wellcome Trust Sanger Institute, Cambridge, UK) [33].

Phylogenetic Tree Construction of Eight Pimpinella Species and Principal Component Analysis
A maximum likelihood (ML) phylogeny was inferred based on the fil-tered SNPs using RAXML v. 8.2.11 (Department of Informatics, Institute of Theoretical Informatics, Germany) [34]. Maximum likelihood searches were conducted in RAXML using a model with ascertainment bias correction (ASC_GTRGAMMA) for sequence data, and a rapid bootstrapping analysis with 100 bootstraps was conducted. A random SNP was retained for each ddRAD locus in order to remove physically linked SNPs for Principal component analysis (PCA). PCA of filtered SNP data was conducted using the R package SNPRrelate (Department of Biostatistics, University of Washington, Seattle, WA, USA) [35]. After converting VCF to GDS format, linked SNPs were re-moved based on co-location using the snpgdsLDpruning function.

Functional Analysis of SNP-Associated Contig
For functional annotation, SNP-associated scaffold sequences that might putatively encode proteins were searched against the non-redundant protein database at the National Center for Biotechnology Information (NCBI) (Be-thesda, Bethesda, MD, USA) with minimum E-value of < 1.0 × e −6 as the threshold. The most comparable sequence matches for each SNP-associated contigs was se-lected and used to find Gene Ontology (GO) terms using EMBL eggnogg mapper (Structural and Computational Biology Unit, European Molecular Bi-ology Laboratory, Heidelberg, Germany) [36]. The three major GO terms, cel-lular component (CC), biological process (BP) and molecular function (MF) were determined with e-value hit filter <1.0 × e −6 . In a final step, details of pathway annotated SNP-associated contigs was performed using the Kyoto Encyclopedia of Genes and Genomes (KEGG) database (http://www.genome.ad.jp/ accessed on 1 January 2000) [37].

Genotyping-by-Sequencing Library Construction and Sequencing
A total of 334,702,966 raw paired-end reads were generated. After cleaning the raw data, we obtained 311,040,452 reads, which were subjected to further trimming and demultiplexing. A total of 1,279,850 contigs with an effective per-sample mean coverage of 27.7×, were de novo assembled. The minimum and maximum lengths of the contigs were 146 and 545 bp, respectively, with an average of 247 bp. The GC content was in the range of 39.5-40.5%. The highest number of SNPs were obtained using a maximum distance between RAD stacks of 6 and a maximum mismatch of 6 between sample loci in the catalog (Table S2).

SNP Calling and Filtering
We selected 1,279,850 contigs from the de novo assembly with a length of 146-151 bp for SNP calling based on Illumina maximum read length. In total, 1,270,791 SNPs were predicted from 625,507 assembled contigs. We then filtered 2440 high-quality SNPs from 625,507 contigs based on a minimum stack size of three reads (m = 3) and a maximum distance between stacks of three (M = 3) and a maximum distance of three nucleotides (n = 3) between samples. After the quality and depth filtering, a mean of 7.8% of SNPs were removed. A statistical summary of data collected about raw reads, cleaned reads are summarized.
A total of 334,702,966 raw paired-end reads were generated. After cleaning the raw data, we obtained 311,040,452 reads, which were subjected to further trimming and demultiplexing. A total of 1,279,850 contigs with an effective per-sample mean coverage of 27.7×, were de novo assembled. The minimum and maximum lengths of the contigs were 146 and 545 bp, respectively, with an average of 247 bp. The GC content was in the range of 39.5-40.5%. The highest number of SNPs were obtained using a maximum distance between RAD stacks of 6 and a maximum mismatch of 6 between sample loci in the catalog (Table S2).

SNP Calling and Filtering
We selected 1,279,850 contigs from the de novo assembly with a length of 146-151 bp for SNP calling based on Illumina maximum read length. In total, 1,270,791 SNPs were predicted from 625,507 assembled contigs. We then filtered 2440 high-quality SNPs from 625,507 contigs based on a minimum stack size of three reads (m = 3) and a maximum distance between stacks of three (M = 3) and a maximum distance of three nucleotides (n = 3) between samples. After the quality and depth filtering, a mean of 7.8% of SNPs were removed. A statistical summary of data collected about raw reads, cleaned reads are summarized.

Population Structure and Genetic Relationship Analysis
The heterozygosity and percentage of polymorphic loci showed that the observed heterozygosity (H0) was higher than the expected heterozygosity (He). The mean H0 and He of the Pimpinella among the eight species were calculated as 0.67 and 0.49, respectively (Table S3). In addition, the mean percentage of polymorphic loci observed among the eight species were found to be 20.05.
PCA was used to assess the diversity in the Pimpinella species using information from filtered SNP. From 2440 high-quality SNPs detected in Pimpinella species, 774 unlinked

Population Structure and Genetic Relationship Analysis
The heterozygosity and percentage of polymorphic loci showed that the observed heterozygosity (H 0 ) was higher than the expected heterozygosity (H e ). The mean H 0 and H e of the Pimpinella among the eight species were calculated as 0.67 and 0.49, respectively (Table S3). In addition, the mean percentage of polymorphic loci observed among the eight species were found to be 20.05.
PCA was used to assess the diversity in the Pimpinella species using information from filtered SNP. From 2440 high-quality SNPs detected in Pimpinella species, 774 unlinked SNPs were used for PCA analysis. The first two principal components (PC1 and PC2) explained 27.9% and 24.5% of the total variance and they were projected in a two-dimensional graphic ( Figure 2). PC1 separated the P. barbata from other species while PC2 separated P. tragium and P. eriocarpa.
SNPs were used for PCA analysis. The first two principal components (PC1 and PC2) explained 27.9% and 24.5% of the total variance and they were projected in a two-dimensional graphic (Figure 2). PC1 separated the P. barbata from other species while PC2 separated P. tragium and P. eriocarpa.

Functional Analysis of SNP-Associated Scaffolds
A total of 642 contigs were searched against the nr (BLASTX) (NCBI non-redundant protein sequences) via BLAST 2.5.0 (NCBI, Bethesda, MD, USA). with a minimum E-value of ˂ 1.0 × e − 6 as a similarity threshold (Table S4). Similarity results were obtained from 98 of the 642 contigs corresponding to known protein sequences. The remaining 544 contigs did not match with any known protein sequences ( Figure 3A,B).
The 98 BLAST hits mainly matched with Daucus carota genes. Seventeen unknown carrot genes with aligned RAD contigs were character-ized using eggNOG 5.0. (Structural and Computational Biology Unit, Euro-pean Molecular Biology Laboratory, Heidelberg, Germany). In total, from 642 RAD contigs 66.4% (426) and 61.5% (395) were intersected genes and intersected exons, respectively. In addition, functional annotations resulted in 496 GO terms (Table S5). These 496 GO terms were further classified into three functional categories such as cellular component (CC, 198 GO terms), molecular function (MF, 88 GO terms) and biological process (BP, 210 GO terms) [38]. Some contigs matched with more than one GO term, whereas a few matched only one GO term. Cellular component annotations were further sub-classified into five main levels of predominant GO subcategories: (1) cell (category I; GO: 0005623) and cell parts (category II; GO: 0044464) were associated with 25 contigs; organelle (category III; GO: 0043226) was associated with 15 contigs and the organelle part category (category IV; GO: 0044422) was associated with 8 contigs; the membrane category (category V; GO: 0016020) was associated with 6 contigs. Most contigs in the MF category were associated with three main GO subcategories: structural molecule (category I, GO: 0005198) with 6 contigs, binding (category II, GO: 0005488) with 25 contigs and catalytic activity (category III, GO: 0003824) with 14 contigs. Biological processes were also categorized into five subcategories: metabolic process (category I; GO: 0008152) with 22 contigs; cellular process (category II, GO: 0009987) with 18 contigs; response to stimulus (category III, GO: 0050896) with 8 contigs; biological regulation (category IV, GO: 0065007) with 5 contigs and regulation of biological process (category VI, GO: 0050789)

Functional Analysis of SNP-Associated Scaffolds
A total of 642 contigs were searched against the nr (BLASTX) (NCBI non-redundant protein sequences) via BLAST 2.5.0 (NCBI, Bethesda, MD, USA). with a minimum E-value of < 1.0 × e −6 as a similarity threshold (Table S4). Similarity results were obtained from 98 of the 642 contigs corresponding to known protein sequences. The remaining 544 contigs did not match with any known protein sequences ( Figure 3A,B).  Figure 4A). The level of 3 GO terms with 32 functional groups are plotted in Figure 4B. More than half of the genes were not annotated in this study, likely due to the sequence lengths and depth SNP or scaffold coverage mean, as is common in studies performing de novo analysis [39,40]. In addition, some of these genes might be unique to Pimpinella species. Analysis of pathway details from annotation results shows that seven contigs are involved in seven different pathways (Table 1). Based on the greatest number of contigs identified in each functional category, the categories detected most often were energy metabolism.   The 98 BLAST hits mainly matched with Daucus carota genes. Seventeen unknown carrot genes with aligned RAD contigs were character-ized using eggNOG 5.0. (Structural and Computational Biology Unit, Euro-pean Molecular Biology Laboratory, Heidelberg, Germany). In total, from 642 RAD contigs 66.4% (426) and 61.5% (395) were intersected genes and intersected exons, respectively. In addition, functional annotations resulted in 496 GO terms (Table S5). These 496 GO terms were further classified into three functional categories such as cellular component (CC, 198 GO terms), molecular function (MF, 88 GO terms) and biological process (BP, 210 GO terms) [38]. Some contigs matched with more than one GO term, whereas a few matched only one GO term. Cellular component annotations were further sub-classified into five main levels of predominant GO subcategories: (1) cell (category I; GO: 0005623) and cell parts (category II; GO: 0044464) were associated Agronomy 2021, 11, 1342 7 of 14 with 25 contigs; organelle (category III; GO: 0043226) was associated with 15 contigs and the organelle part category (category IV; GO: 0044422) was associated with 8 contigs; the membrane category (category V; GO: 0016020) was associated with 6 contigs. Most contigs in the MF category were associated with three main GO subcategories: structural molecule (category I, GO: 0005198) with 6 contigs, binding (category II, GO: 0005488) with 25 contigs and catalytic activity (category III, GO: 0003824) with 14 contigs. Biological processes were also categorized into five subcategories: metabolic process (category I; GO: 0008152) with 22 contigs; cellular process (category II, GO: 0009987) with 18 contigs; response to stimulus (category III, GO: 0050896) with 8 contigs; biological regulation (category IV, GO: 0065007) with 5 contigs and regulation of biological process (category VI, GO: 0050789) with 4 contigs ( Figure 4A). The level of 3 GO terms with 32 functional groups are plotted in Figure 4B. More than half of the genes were not annotated in this study, likely due to the sequence lengths and depth SNP or scaffold coverage mean, as is common in studies performing de novo analysis [39,40]. In addition, some of these genes might be unique to Pimpinella species.   Analysis of pathway details from annotation results shows that seven contigs are involved in seven different pathways (Table 1). Based on the greatest number of contigs identified in each functional category, the categories detected most often were energy metabolism. A phylogenetic tree using maximum likelihood method was constructed on the basis of identified SNPs. The result revealed that eight Pimpinella species are clearly separated into three clusters. The first cluster is comprised of P. tragium, the second cluster of P. eriocarpa and the third cluster of P. barbata, P. tragioides, P. kotschyana, P. tragium, P. aurea and P. anisum ( Figure 5).

Discussion
Detailed characterization of the genetic structure of species is one of the most important prerequisites in the application of breeding programs and efficient protection and use of plant genetic resources [41,42]. Thus, ddRAD is considered as one of most reliable and powerful approaches to provide more effective SNP genotyping. To date, relatively few studies have examined the genetic structure and relationships between Pimpinella species using a PCR-based approach [4,12,13,43]. Furthermore, only two studies, by Wang et al. [15] and Fereidounfar et al. [44], examined the phylogenetic relationships and these used nrDNA ITS and cpDNA intron sequence data. In this study, we set out to use ddRAD-seq to discover genome-wide SNPs in Iranian endemic Pimpinella species in a cost-and time-efficient manner. To achieve this purpose, we selected the HinfI and HpyCH4IV enzyme combination which led to a sufficient read depth to perform SNP calling across different species in the absence of a reference genome. Only 39.0% of the trimmed contigs were aligned using the reads, possibly due to the stringent parameters used by the STACKS de novo to minimize multiple mapping. A similar limitation was also reported for genome-wide SNP discovery in Capsicum annuum germplasm [45].
In the present study, a total of 334,702,966 raw paired-end reads were produced. This high number of raw sequencing reads among the eight Pimpinella species reflected reduced levels of contamination and unexpectedly low sequence repeats. We applied STACKS de novo for SNP calling, which filled the specifications required for SNP discovery from the trimmed reads in the absence of a reference genome sequence [46]. Using this tool, we eliminated low-quality and contaminated reads using efficient filtering criteria. Regardless of the complexities included, 2440 SNPs were identified in this investigation after the initial quality check. The average SNP frequency, 1.5 SNP per 100 bp of filtered ddRAD loci, was higher than the SNP frequency in Apiaceae family [24]. Identification of SNPs obtained from this study maximizes the probability of finding efficient molecular markers in Pimpinella. Such moderate frequency of SNPs retained after the initial quality check in Pimpinella species indicates the importance of generating genome-wide SNPs, which, critically, include markers from transcribed regions and regulatory regions [17].
A major challenge encountered by all genotyping methods has been the difficulty of aligning true alleles of each single locus in plants for which whole-genome sequences and reference genome are not available, such as the Pimpinella genus. In addition, information on the levels of heterozygosity within a selected species would be of value for elucidating

Discussion
Detailed characterization of the genetic structure of species is one of the most important prerequisites in the application of breeding programs and efficient protection and use of plant genetic resources [41,42]. Thus, ddRAD is considered as one of most reliable and powerful approaches to provide more effective SNP genotyping. To date, relatively few studies have examined the genetic structure and relationships between Pimpinella species using a PCR-based approach [4,12,13,43]. Furthermore, only two studies, by Wang et al. [15] and Fereidounfar et al. [44], examined the phylogenetic relationships and these used nrDNA ITS and cpDNA intron sequence data. In this study, we set out to use ddRAD-seq to discover genome-wide SNPs in Iranian endemic Pimpinella species in a cost-and time-efficient manner. To achieve this purpose, we selected the HinfI and HpyCH4IV enzyme combination which led to a sufficient read depth to perform SNP calling across different species in the absence of a reference genome. Only 39.0% of the trimmed contigs were aligned using the reads, possibly due to the stringent parameters used by the STACKS de novo to minimize multiple mapping. A similar limitation was also reported for genome-wide SNP discovery in Capsicum annuum germplasm [45].
In the present study, a total of 334,702,966 raw paired-end reads were produced. This high number of raw sequencing reads among the eight Pimpinella species reflected reduced levels of contamination and unexpectedly low sequence repeats. We applied STACKS de novo for SNP calling, which filled the specifications required for SNP discovery from the trimmed reads in the absence of a reference genome sequence [46]. Using this tool, we eliminated low-quality and contaminated reads using efficient filtering criteria. Regardless of the complexities included, 2440 SNPs were identified in this investigation after the initial quality check. The average SNP frequency, 1.5 SNP per 100 bp of filtered ddRAD loci, was higher than the SNP frequency in Apiaceae family [24]. Identification of SNPs obtained from this study maximizes the probability of finding efficient molecular markers in Pimpinella. Such moderate frequency of SNPs retained after the initial quality check in Pimpinella species indicates the importance of generating genome-wide SNPs, which, critically, include markers from transcribed regions and regulatory regions [17].
A major challenge encountered by all genotyping methods has been the difficulty of aligning true alleles of each single locus in plants for which whole-genome sequences and reference genome are not available, such as the Pimpinella genus. In addition, information on the levels of heterozygosity within a selected species would be of value for elucidating the underlying population structure, for calculating minimum population sizes required for maintaining genetic diversity and also for future estimations of genetic gain [17,47]. Our method and selected tools have effectively calculated the heterozygosity and excluded only 175 SNPs, or 7.2% of the quality filtered SNPs. This value is lower and higher than has been previously reported by Duangjit et al. [48] and Jo et al. [17] which consisted of 12.7% and 5.9% heterozygous SNPs. The lower frequency of heterozygous calls could be attributed to the bias arising from our relatively small sample size [49]. The number of SNPs identified in this study was restricted by the capture of a reduced portion of the genome following the combination of the HinfI and HpyCH4IV enzymes and stringent SNP calling by stacks. Although, additional SNPs could have been identified by increasing the value of M and n in species with higher level of polymorphism [50,51].
The C/T allele (734, 30%) occurred most frequently between SNP alleles. Similar results were also observed in other species including Allium cepa [18,39], Brassica napus [52], Cucumis melo [53] and oil palm [54]. The transition/transversion ratio in this study was 1.42, which is lower than has been previously reported in Triticum aestivum L. (1.75) [20], Oryza sativa (2.3) [55], Arachis hypogaea (3.2) [54] and Allium cepa (2.53) [18]. This is the first study in Pimpinella to develop genome-wide SNPs using the GBS method without a reference genome. However, de novo assembly was successfully used to design a SNP array, construct linkage maps, high density genetic and transcriptome analyses in Allium cepa [39], Cicer arietinum L. [22] and Hordeum vulgare [56]. Thus, the identified SNPs with associated flanking sequences can be usable for high-throughput validation assays in Pimpinella breeding programs.
The relatively higher frequency of observed heterozygosity than expected for this study suggests that this species may have recently experienced a genetic bottleneck [57]. In Eucalyptus populations, a similar effect was reported with H 0 > H e [58]. The low percentage of polymorphic loci among species indicates significant heterogeneity of genetic architecture among species as well as gene-by-environment effects [59,60].
Maximum likelihood phylogenetic construction based on 2440 filtered SNPs generally grouped the eight species of Pimpinella into three main clusters. Cluster 3 had the most species including P. barbata, P. kotschyana, P. tragioides, P. affinis, P. anisum and P. aurea. The species of P. tragium and P. eriocarpa created a separate cluster. The results of this study suggest that the species within one cluster have the most homology in SNP loci. Iranian Pimpinella has been rarely studied from molecular viewpoint. Consistent with previous data [44], a close correlation was observed between geography and the phylogenetic tree in our analysis. P. tragium is south of the East European native and constituted an early diverging cluster and formed a sister cluster to Southwest Asia Pimpinella. It has been suggested that P. tragium is is the origin of the predominantly herbaceous Apiaceae subfamily Apioideae, and chromosome base number of x = 8. However, all species of Pimpinella from Southwest Asia origin fall within a cluster. Based on nuclear region IITS and cpDNA within the context of the genus Pimpinella, Fereidounfar et al. [44] placed some Iranian species in one tribe. Recently, molecular studies grouped P. affinis, P. aurea, P. tragioides, P. barbata and P. kotschyana based on nuclear region IITS and cpDNA [44]. Our phylogenetic trees were consistent with those previously reported, such that P. tragium and P. eriocarpa are considered a separate group within Pimpinella [44]. One of the greatest advantages of using ddRAD-seq for phylogenetic reconstruction, as opposed to the traditional methods of using one to several genes, is that the ddRAD approach samples data from many loci among the entire genome. This suggests that ddRAD-seq data could be profitably applied to methods for multi-locus species tree estimation. RAD-seq are usually considered applicable for phylogenetic reconstruction in species in which sufficient numbers of orthologous restriction sites are retained among species [61].
Estimation of genetic distance between species is one of useful tools for species registration and protection and parental selection in Pimpinella hybridization programs. Cluster and PCA analysis are appropriate methods in genetic diversity identification, tracing the pathway of the evolution of species, parental selection and center of origin and diversity [62]. In this study, we performed PCA on the SNPs data using different species.
PCA result illustrated three main groups of Pimpinella species. This technique has been used with great success in a number of recent population genetic studies [63][64][65]. The result of ML and PCA may have relative differences from each other due to the use of the only first two components in the PCA. When the two first principal components account for high variation percentage, clustering according to these two components can be a useful method to find the clusters. However, cluster analysis based on PCA is a more explicit indicator of differences among species than cluster analysis without PCA based [66].
Within the detected SNPs, the 58.7% of transition (A/G, C/T) type were found in Pimpinella species. Transversions (A/T, A/C, C/G and G/T) SNP ratios in Pimpinella species can be used to measure the genetic distance between the species. Transitions polymorphism occurred more frequently than transversions, consistent with the nature of these changes [67].
In total, BLASTX searches against the non-redundant protein database identified 98 of the 642 contigs corresponding to known protein sequences, whereas 544 contigs did not match with any known protein sequence, suggesting that our Illumina paired-end sequencing were unique to Iranian Pimpinella species.
Analysis of metabolic pathway from annotation results showed that seven different pathways, including amino acids metabolism, DNA replication, RNA replication and energy metabolism, were identified. Overall, these pathway details from ddRAD-sequencing will provide valuable information for understanding more about Iranian Pimpinella species.
Plant breeders have always been interested in selecting plant materials based on their germplasm collections, long-term consistent assistance and relatedness limitations to support breeding programs. Relatedness analysis helps plant breeders to understand the backgrounds of their plant materials [18]. In studies on Vigna unguiculata [68] and Allium sativum [69], this model was used in genomic selection and association-related studies.
In this study, the data collection and analysis process provided a novel step forward in the use of ddRAD data to address questions in Pimpinella species genomic structure. We show that the reduced representation genotyping approach is an alternative method to whole-genome resequencing with using restriction site associated DNA sequencing (RAD-Seq).

Conclusions
In this study, we identified highly valuable SNP resources from Iranian Pimpinella species using ddRAD-seq analysis. In our literature review, this is the first report of de novo analysis pipeline being used for the discovery of SNPs in Iranian Pimpinella species. Our investigation provides high-quality SNPs, with details of their genetic structure and their annotated functions, and will be useful for deepening our understanding of Pimpinella genomic resource for genetic diversity and relatedness among species, marker-assisted selection programs, trait dissection, breeding and high-density map development.