Development and Transferability of EST-SSR Markers for Pinus koraiensis from Cold-Stressed Transcriptome through Illumina Sequencing

Pinus koraiensis has significant economic and ecological value in Northeast China. However, due to the lack of suitable molecular markers, only a few available microsatellite markers were developed for further population genetics studies. In this study, for the first time we developed expressed sequence tag–simple sequence repeat (EST-SSR) markers from the cold-stressed transcriptome of P. koraiensis using Illumina Sequencing. We identified a total of 7,235 EST-SSRs from 97,376 sequences, and we tested their transferability among seven related Pinus species. The results showed that trinucleotides were the most abundant type of repeat (1287, 18.74%) excluding mononucleotides, followed by dinucleotides (1284, 18.7%) and tetranucleotides (72, 1.05%). The most dominant dinucleotides and trinucleotide repeat motifs were AT/AT (535, 7.79%) and AAT/ATT (103, 1.5%). The observed heterozygosity (Ho) and expected heterozygosity (He) ranged from 0.002 to 0.986 and 0.017 to 0.743, respectively, and the polymorphism information content (PIC) values and number of alleles (Na) varied from 0.029 to 0.794 and 2 to 23, respectively. A total of 8 natural P. koraiensis populations were divided into two main genetic clusters. Furthermore, nine of twenty polymorphic primer pairs were successfully amplified in seven Pinus species, and at least 80% of the successful P. koraiensis EST-SSR primers could be amplified in more than four species (16, 80%). Combined results for the development of EST-SSR markers in P. koraiensis and transferability among related species would contribute to improved studies on the genetic diversity and population structure in P. koraiensis and phylogenetic relationships among Pinus species. They would also provide a significant source for quantitative trait locus analysis.


Transcriptome Data Capture
The cold-stressed transcriptome data used in this study originated from the study in our laboratory by Wang et al. [23], and the relative transcriptome data was uploaded to the Sequence Read Archive (SRA) public database (No. PRJNA510863) at the National Center for Biotechnology Information (NCBI).  Figure 1). To verify and screen the polymorphic primers, ten representative adult trees in each populations were randomly selected, and all sampled individuals from eight populations were spaced at least 100 m apart. In total, 80 individuals from eight natural P. koraiensis population were collected in this study, including Zhanhe (ZH), Liangshui (LS), Tieli (TL), Liangzihe (LZH), Hongshi (HS), Sanchazi (SCZ), Lushuihe (LSH) and Helong (HL) ( Table S1). The sampled individuals were distributed in the main natural geographical distribution region of P. koraiensis. The collected samples were immediately frozen in liquid nitrogen and then stored at −80 • C. Total DNA was extracted from the fresh needles of 80 individuals, according to the hexadecyl trimethyl ammonium bromide (CTAB) method. The DNA integrity and concentration were further determined by 1% agarose gel electrophoresis and using a K5500Plus micro-spectrophotometer (KAIAO technology development Co. Ltd., Beijing, China), respectively. Finally, the total DNA was diluted to the desired working concentration (25 ng/µL) and stored at −20 • C for PCR analysis, to validate the EST-SSR reliability.

Transcriptome Data Capture
The cold-stressed transcriptome data used in this study originated from the study in our laboratory by Wang et al. [23], and the relative transcriptome data was uploaded to the Sequence Read Archive (SRA) public database (No. PRJNA510863) at the National Center for Biotechnology Information (NCBI).

EST-SSR Locus Detection and Primer Design
The detection and localization of potential SSRs were searched and implemented from 97,376 unigenes (75,061,632 bp) using the MISA tool (http://pgrc.ipk-gatersleben.de/misa/misa.html) [24]. The identification standards of minimum repeats of the SSR motif contained ten mononucleotide, six di-nucleotide, and five tri-, tetra-, penta-, and hexanucleotide motif repeats. Particularly, SSR motifs containing one mononucleotide were considered to be associated with a high risk of error during synthesis of the PCR product and were ultimately removed from the study. All primers for the SSR locus were designed using Primer3 software (http://primer3.sourceforge.net/releases.php). The main stated selection criteria for screening the primers were as follows-GC content from 45-55% with 55% as the optimum; primer length from 18-24 bp with 23 bp as the optimum length; annealing temperature between 55 and 65 • C, with 60 • C as the optimum temperature; and PCR product size ranging from 100 to 300 bp.

PCR Amplification and SSR Validation
To validate the quality of primer pairs, a total of 300 primer pair sequences were randomly selected and synthesized by Beijing Genomics Institute Tech Solutions (Bejing Liuhe) Co., Ltd. (Beijing, China) and used for PCR amplification in a 20-µL volume that included 10 µL 2 × Super PCR Mix (with green dye) (Beijing Genomics Institute Tech Solutions (Bejing Liuhe) Co., Ltd., Beijing, China), 4 µL M13 universal primer (1 µM, 5-TGTAAAACGACGGCCAGT-3), 5 end addition of four fluorescent dyes to all forward primers (ROX, HEX, FAM and TAMRA), 2 µL genomic DNA (25 ng/µL), 0.8 µL forward primer (1 µM), and 3.2 µL reverse primer (1 µM). The PCR amplification conditions were as follows-94 • C for 5 min, followed by 30 cycles at 94 • C for 30 s, 57 • C for 30 s, and 72 • C for 30 s, followed by 8 cycles at 94 • C for 30 s, 55 • C for 30 s, and 72 • C for 30 s, and then a final extension at 72 • C for 10 min, using a TC-96/G/H(b)B Thermal Cycler (Hangzhou Bioer Technology Co., Ltd., Hangzhou, China). Finally, all primer pairs were used for amplification from eighty samples representing different populations for the development and assessment of EST-SSR polymorphisms.

Statistical Analyses
GeneMapper (version 4.1) was used to analyze the original data obtained using high-performance capillary electrophoresis (HPCE). GeneAlEx (version 6.5) [25] was used to calculate the population genetic parameters of each primer pair, including the number of alleles per locus (Na), observed heterozygosity (Ho), expected heterozygosity (He), and effective number of alleles (Ne). Information index and polymorphism information content (PIC) values were calculated using the Cervus software (version 3.0) [26]. In addition, the principal component analysis (PCA) was also performed in GeneALEX (version 6.5). To determine the genetic variation of populations including within populations, among populations within groups and among groups, the analyses of molecular variance (AMOVA) function in GeneAlEx version 6.5 was applied. In total, 999 random permutations were implemented for each sample in the GenAlEx software. The population genetic structure of eight P. koraiensis natural populations was analyzed by the STRUCTURE software (version 2.3.4) [27], based on Bayesian analysis. Ten independent runs were performed for each K value. In brief, the best K value was determined by the method described by [28]. The burn-in period iterations and Markov chain Monte Carlo repetitions for each run (K ranged from 1 to 8) were 100,000 each. The results obtained from STRUCTURE were further analyzed with the software of STRUCTURE HARVEST [29].

EST-SSR Primer Transferability in Related Species
EST-SSR primer pairs developed from the EST sequences had higher transferability in related species than SSRs in genomic DNA. In the present study, a total of seven Pinus species were used to detect and verify the transferability of 20 EST-SSR primer pairs in related species, including two-, three-, and five-needle pines. The species used in this study were as follows-Pinus tabuliformis and Pinus sylvestris  46 12 E). In this study, 32 samples were used to test the transferability of EST-SSR primers in Pinus species, and the samples collected from Pinus sibirica, Pinus pumila, Pinus wallichiana, Pinus parviflora, Pinus bungeana, Pinus tabuliformis, and Pinus sylvestris was 5, 5, 5, 2, 5, 5, and 5, respectively. The DNA extraction and PCR amplification methods for all Pinus species were performed as described above, with appropriate adjustments of the annealing temperature, template DNA concentration, and cycle times.

Results and Discussion
In total, 97,376 sequences containing 7235 SSRs were detected and identified from the transcriptome data, with 509 EST sequences containing more than one SSR and 354 (4.9%) SSRs present in the compound formation. Furthermore, the EST-SSR frequency was 6.84%, and the distribution density was 75.06 per Mb. The tri-nucleotide (1287, 18.74%) was the most abundant type of repeat excluding the mononucleotide, followed by the di-nucleotide (1284, 18.7%), tetra-nucleotide (72, 1.05%), hexa-nucleotide (53, 0.77%), and penta-nucleotide (25, 0.36%) ( Table 1). In total, the repeat types were mainly distributed from mono-to tri-nucleotides, which accounted for 97.82% of the SSRs. This result was similar to the findings reported by Zhang et al. [30]. The frequencies of EST-SSRs for different tandem repeats are shown in Table 2. The lowest repeat motif for each SSR was five, and the largest number of tandem repeats was ten (1864, 27.14%), followed by 11 (937, 13.64%), 5 (929, 13.53%), 6 (709, 10.32%), 12 (530, 7.72%), 7 (346, 5.04%), and 13 (345, 5.02%); the other tandem repeats were less than 20%. The main tandem repeats were 10-13, reaching 53.52% of the SSRs ( Table 2). The main motif type was 40 in 7235 SSRs. Generally, the EST-SSRs in many plants are mainly dinucleotides and trinucleotides, with the exception of single nucleotides. Remarkably, A/T (4090, 59.55%) was the dominant motif, followed by AT/AT (535, 7.79%) and TA/TA (394, 5.74%), which accounted for 73.08% of the total number of SSRs (Table 3). These results were similar to those found by Jia, which further support the conclusion that A/T and AT/AT are the main unit types (accounting for >50%) in P. koraiensis [31]. Among the SSRs, 535 and 103 contained the most abundant di-and tri-motif types, respectively. In contrast, the numbers of CG/CG and ACT/AGT were very few with approximately equal frequencies, which might indicate that CG/CG and ACT/AGT are rare in gymnosperm plants [32].    To determine the amplification efficiency of EST-SSR, we previously found that because of the influence of primer dimers, annealing temperature and number of introns, some EST-SSRs usually failed to amplify the expected product. In the present study, three hundred 1,884 primer pairs were randomly selected to detect and evaluate their polymorphisms across 4 individuals from different geographical populations in five regions. Furthermore, 96 (32%) of 300 amplified single and bright DNA bands ( Figure S1) were further used to detect the polymorphism level across 16 P. koraiensis individuals sampled from five main distribution regions in Northeast China and 119 (39.67%) amplified PCR products that were larger or smaller than expected. The amplified PCR products were not found in 85 (28.33%) primer markers at various annealing temperatures. Using the same DNA templates, the results showed that 20 (6.7%) of the 96 amplified single and bright DNA bands successfully amplified the expected product size and could be used to conduct the population and conservation genetics studies; the remaining primer pairs were identified as monomorphic. Of the 20 amplified EST-SSRs, the di-(10, 50%) and tri-nucleotides (10, 50%) had the same number of repeat type and polymorphism frequencies; the other types of repeat units were not found in the present study.
The polymorphism parameters and sequence information for the amplified ESR-SSRs among 80 natural P. koraiensis individuals are listed in Table 4 .032 for PIC, respectively. These results showed a high degree of genetic variation of each locus within the population, which is of great practical significance for the evaluation and selection of P. koraiensis germplasm resources. In general, the genetic diversity level of a species is affected by a number of factors, including the biological characteristics, population size, number of genetic markers, and geographical distribution range [33][34][35]. The twenty polymorphic EST-SSR primer pairs were developed based on the cold-stressed transcriptome of P. koraiensis, and a relatively moderate level of genetic diversity was detected using twenty EST-SSR markers in eighty P. koraiensis natural individuals, with mean Na, He, and PIC values of 6.45, 0.311, and 0.404, respectively. These results were similar to those found in fifty-three P. koraiensis germplasm resources of four seed orchards (I = 0.654, PIC = 0.325), as determined by six pairs of highly polymorphic EST-SSR primers [36]. However, in this study, the genetic diversity of P. koraiensis was lower than sixty open-pollinated families (I = 0.868, PIC = 0.450), determined using sixteen pairs of EST-SSR markers [31], and 204 samples of seven natural populations determined by nine polymorphic primers (Na = 6.7, He = 0.610) [12]. The main reason for this finding was that only 10 individuals for each natural populations were used to detect the genetic diversity level. Another reason might be differences in population types. Generally, the genetic diversity within cultured populations might maintain a relatively lower level compared with natural populations. The genetic diversity of P. koraiensis in this study was lower than naturally distributed Pinus species, including Pinus tabulaeformis [37] (mean Na Here, the reasons for the differences in levels of genetic diversity among different species included the following-(1) the mating mode, e.g., pollination methods used for the breeding population would affect the level of genetic diversity, where generally, with the increase in the self-crossing rate, the population genetic diversity tended to decrease; (2) genetic drift caused a loss or fixation of alleles, which would lead to a decline in the population genetic diversity; and (3) artificial and natural selection. In addition, the genetic diversity of species in narrow natural distribution areas, such as local varieties, endangered species, and endemic species, was generally lower than species with wide natural distribution areas. In our study, P. koraiensis was only naturally distributed in Northeast China, but it had a relatively moderate level of genetic diversity. The main reason for this phenomenon is that more than half of P. koraiensis worldwide is naturally distributed in Northeast China, with abundant genetic resources.
Recently, EST-SSR markers developed using expressed sequence tags have been widely used in the fields of genetic diversity, phylogenetic development, and molecular marker-assisted breeding. At present, the development of universal EST-SSR primers is a cheap and simple approach for species without whole genome and developed microsatellite loci [40]. However, for some species with complex genetic background, such as non-model species, genome-free species, and endangered species, traditional methods for SSR marker development are complex and expensive. Previous studies have found that because of the conservative lateral sequence of microsatellite DNA, EST-SSR primers have good transferability among close and even distantly related species [41,42]. With these crucial characteristics, they have been used for comparative genomics and phylogenetic analysis in several species, such as Theobroma cacao [43], Chrysanthemum morifolium [44], Elymus sibiricus [45] and Taxus mairei [46]. In the present study, 20 successful EST-SSR primers were further tested in Pinus relatives to test the transferability of EST-SSR primers (Table 5). Different genetic relationship showed different levels of transferability for P. koraiensis EST-SSRs. The majority of P. koraiensis EST-SSR primers generated a relatively high amplification efficiency ranging from 0% to 100% in Pinus species. There were 9 (NEPK-40, NEPK-32, NEPK-53, NEPK-72, NEPK-43, NEPK-38, NEPK-150, NEPK-179 and NEPK-184) and 1 (NEPK-34) primer pairs that generated the highest and lowest amplification efficiency, respectively; the other 10 primer pairs showed a moderate level of amplification efficiency. At least 80% of the successful P. koraiensis EST-SSR primers could be amplified in more than four of the Pinus species (16, 80%). Interestingly, at least 70% (5, 71%) of the tested Pinus species had a relatively high level of EST-SSR transferability (more than 70%). Compared to three-and two-needle pine, three five-needle pines showed dramatical levels of transferability (greater than 80%). The main reasons for these cases were as follows-(1) the transferability of EST-SSR primers is related to the genetic relationship between two species; (2) Pinus sibirica, Pinus pumila, and Pinus wallichiana belong to the five-needle group and are widely assumed to have a closer genetic relationship than two-and three-needle species. These results revealed high genome homology among seven Pinus species, and nine of 20 newly developed EST-SSRs from cold-stressed transcriptome of P. koraiensis could be used as universal markers in the Pinus species for further population genetics studies in related species and the identification of functional genes with significant economic traits.  The "+" and "-" indicate successful and failed PCR amplification, respectively.
The population genetic structure of 80 individuals from 8 P. koraiensis natural populations was analyzed by the STRUCTURE 2.3.4 software. In this study, the number of cluster was set from 1 to 8 with 10 repetitions for each run. According to the STRUCTURE assignments, the optimal K value was observed at K = 2 with maximum Delta k value ( Figure 2a); all collected individuals were further divided into two clusters (C1 and C2) (Figure 2b). C1 contained three populations (LS, ZH, and TL), and C2 contained five populations (LZH, LSH, HL, HS, and SCZ). Furthermore, the PCA analysis based on 20 EST-SSR markers was used to further evaluate the population genetic structure and demonstrated that the sampled P. koraiensis individuals were well-grouped into two groups based on the first two principal coordinates (Figure 3). In which, coordinates 1 explained 33.12% of the total variation, and coordinate 2 explained 13.2% of the total variation. The PCA results were consistent with the results of the structure analysis using the same EST-SSR markers and individuals. Notably, previous study reported that the seven natural P. koraiensis populations in Northeast China were also divided into two main groups, including two northern populations and five southern populations [12]. However, in our study, although the LZH P. koraiensis population were closer to the Xiaoxinganling Mountains populations (the northern population), it was eventually divided into the Changbaishan Mountains populations (the southern populations), based on the STRUCTURE and PCA analysis. The main reasons for this were: (1) The LZH population was located on the border between the Xiaoxinganling Mountain and Sanjiang Plain geographically, which might allow it to have a low level of gene flow with the populations of the Xiaoxinganling Mountains. (2) The marginal distribution might change the allele frequency among populations, and further affect the patterns of population structure. (3) Due to the changed climate and frequent human activity, the allele frequency are indirectly influenced by these external environment factors. However, we might be able to draw a basic preliminarily conclusion that geographic populations with similar habitats might be more likely to form the same genetic cluster. Furthermore, according to current study, we might conclude that the natural P. koraiensis populations in Northeast China can be divided into two groups based on the geographic distribution of this species.  The AMOVA analysis among and within populations are shown in Table S2, and the results revealed that 55.38% of the total genetic variance appeared among populations, while only 44.62% genetic variance occurred within the populations. These results indicated that most molecular The AMOVA analysis among and within populations are shown in Table S2, and the results revealed that 55.38% of the total genetic variance appeared among populations, while only 44.62% genetic variance occurred within the populations. These results indicated that most molecular variance was within populations, which was similar to the results for AMOVA analysis in the P. koraiensis (97.65% within populations), based on the nine EST-SSR markers [12]. Gene flow is an important factor affecting the genetic structure of plant population structure. In this study, the lowest Nm value was 0.321 between was shown between LS and HL, while the highest was observed between HL and LZH (Table S3)  The AMOVA analysis among and within populations are shown in Table S2, and the results revealed that 55.38% of the total genetic variance appeared among populations, while only 44.62% genetic variance occurred within the populations. These results indicated that most molecular

Conclusions
In this study, we characterized and evaluated a number of EST-SSR markers derived from the transcriptome of P. koraiensis, and a total of twenty EST-SSR primer markers were verified with abundant polymorphisms that provide new insights into population genetics research of P. koraiensis. We tested the population genetic structure of natural P. koraiensis and found two genetic clusters. Furthermore, nearly half of these EST-SSR primer pairs were transferable across other species in the Pinus species, which would significantly reduce the cost of microsatellite marker development for P. koraiensis and its related species. These SSR primer pairs would provide novel genetic information for molecular breeding and could be used to accelerate the genetic improvement and breeding applications for these Pinus species.

Conflicts of Interest:
The authors declare no conflict of interest.