Genetic Evaluation of Natural Populations of the Endangered Conifer Thuja koraiensis Using Microsatellite Markers by Restriction-Associated DNA Sequencing

Thuja koraiensis Nakai is an endangered conifer of high economic and ecological value in Jilin Province, China. However, studies on its population structure and conservation genetics have been limited by the lack of genomic data. Here, 37,761 microsatellites (simple sequence repeat, SSR) were detected based on 875,792 de novo-assembled contigs using a restriction-associated DNA (RAD) approach. Among these SSRs, 300 were randomly selected to test for polymorphisms and 96 obtained loci were able to amplify a fragment of expected size. Twelve polymorphic SSR markers were developed to analyze the genetic diversity and population structure of three natural populations. High genetic diversity (mean NA = 5.481, HE = 0.548) and moderate population differentiation (pairwise Fst = 0.048–0.078, Nm = 2.940–4.958) were found in this species. Molecular variance analysis suggested that most of the variation (83%) existed within populations. Combining the results of STRUCTURE, principal coordinate, and neighbor-joining analysis, the 232 individuals were divided into three genetic clusters that generally correlated with their geographical distributions. Finally, appropriate conservation strategies were proposed to protect this species. This study provides genetic information for the natural resource conservation and utilization of T. koraiensis and will facilitate further studies of the evolution and phylogeography of the species.


Introduction
Thuja koraiensis Nakai, an evergreen coniferous member of Cupressaceae, is mainly present in the Korean Peninsula and Changbai Mountain area of China [1]. A high seed abortion rate, low seed reproductive efficiency, and overexploitation have reduced the geographic distributions of natural T. koraiensis populations, and their natural regeneration is usually poor [2]. In 1992, the species was

Plant Material and DNA Extraction
A field survey was conducted on the extant T. koraiensis populations from May to June 2016. Five extant natural forestry centers were investigated in Jilin Province, but T. koraiensis samples were found and collected in only three populations. In Lenggouzi (LGZ; 126 • 28 E, 41 • 37 N), more than 1000 trees were identified, and 155 samples were collected. Fifty individuals were found in Sandaogou (SDG; 126 • 28 E, 41 • 51 N), and 26 samples were collected. In Dajinggou (DJG; 126 • 44 E, 41 • 52 N), approximately 600 trees were identified, and 51 samples were collected ( Figure 1, Table 1). To avoid sampling clones, the distances between sampled trees was~50 m. Fresh leaves were dried in silica gel and stored at −80 • C until needed for DNA isolation. The genomic DNA for RAD sequencing was extracted from an adult tree using a Plant Genomic DNA Kit (Tiangen Biotech, Beijing, China) following the manufacturer's procedures. Total genomic DNA of all samples was extracted by using the method of cetyltrimethylammonium bromide (CTAB) that was reported by Jin et al. [36]. The DNA was analyzed by agarose gel electrophoresis, and the concentration and purity were checked by a NanoDrop ® ND-1000 (Thermo Scientific, Wilmington, DE, USA) and Qubit 3.0 Fluorometer (Invitrogen, Carlsbad, CA, USA).  LGZ: Lenggouzi; SDG: Sandaogou; DJG: Dajinggou.

RAD Library Preparation, Sequencing, and Assembly
The DNA samples were processed into RAD libraries in a manner similar to that reported by Baird et al. [27]. Briefly, genomic DNA (1 µg) was digested for 60 min at 37 • C in a 20-µL reaction with 20 units (U) of EcoRI; then, the samples were heat-inactivated at 65 • C for 25 min. The P1 adapter was ligated to the products of the restriction reaction after digestion. One microliter of 10-µM P1 adapter was added to the sample, along with 1.5 µL T4 DNA Ligase and 2.5 µL 10× T4 DNA Ligase Buffer, and the reaction was incubated for 60 min at 4 • C, 15 min at 20 • C, and then held at 4 • C. Immediately after, DNA fragmentation was conducted; the VAHTS Fragmentase (Vazyme, Nanjing, China) was mixed thoroughly by gently vortexing for 3 s, and then the tube was transferred to ice. The fragmentation mix contained 25 µL DNA, 3 µL 10× VAHTS Fragmentase Reaction Buffer, and 2 µL VAHTS Fragmentase; and was incubated for 25 min at 37 • C and held at 4 • C. Samples were then separated by electrophoresis through a 2% agarose gel. The DNA fragments from 350 to 450 bp were isolated using a QIAquick Gel Extraction Kit (Qiagen, Hilden, Germany) and plused poly(A) tail. The following components were added to 40 µL of reaction liquid obtained from the previous step: 2 µL nuclease-free water, 5 µL End Repair Reaction Buffer (10×), and 3 µL End Prep Enzyme Mix, followed by incubation at 25 • C for 20 min and 72 • C for 20 min, and was then held at 4 • C. One microliter of 10 µM P2 adapter was ligated, as described above for P1. The ligated product was purified using 1.0× Agencourt AMPure XP Beads (Beckman Coulter Genomics, Danvers, MA, USA).
Polymerase chain reaction (PCR) enrichment of the library was performed in PCR reactions of 50-µL volume that contained 25 µL 2× Kapa HiFi PCR Mix (KAPA Biosystem, Wilmington, MA, USA), 1.25 µL Universal PCR Primer/i5 Prime, 1.25 µL Index (X) Primer, and 2.5 µL nuclease-free water. The PCR conditions were: initial denaturation at 98 • C for 45 s, then 15 cycles of 98 • C for 10 s, 60 • C for 30 s, and 72 • C for 30 s, and a final extension at 72 • C for 7 min. PCR products were purified using 1.0× Agencourt AMPure XP Beads following the manufacturer's protocol, and then pooled together using the Qubit 3.0 Fluorometer (Invitrogen, Carlsbad, CA, USA) and Qsep100 Capillary electrophoresis apparatus (Bioptic, Taiwan, China). Sequencing was performed by Illumina Hiseq 2500 platform (Illumina, San Diego, CA, USA) that generated paired-end reads of 150 nucleotides. The preparation of DNA sequencing libraries and deep sequencing and analysis were performed by Beijing Ori-Gene Science and Technology Corp., Ltd. (Beijing, China). The raw data have been deposited in the NCBI Sequence Read Archive under the accession number SRR5338071.

SSR Identification and Marker Design
Sequences that contained SSR motifs were identified using the MISA search tool (http://pgrc. ipk-gatersleben.de/misa/). The parameters were set as follows: the minimum repeat unit was defined as 10 repeats for mononucleotide motifs; six repeats for dinucleotide motifs; and five repeats for tri-, tetra-, penta-, and hexanucleotide motifs. SSR primers were designed using Primer3 (https://sourceforge.net/projects/primer3/). Primers were designed according to the following criteria: amplified regions that ranged from 100-350 bp, primer annealing temperatures that ranged from 55-60 • C, optimal primer length that ranged from 18-24 bp, and GC content that ranged from 40-60%.
Furthermore, 300 SSR loci were randomly selected and screened in 15 individuals. Primers were synthesized by Sangon Biotech (Shanghai, China). These primers pairs were validated by PCR using the M13-tail technique, where an M13-tagged sequence (5 -TGTAAAACGACGGCCAGT-3 ) was added at the 5 end of all forward primers to allow fluorescent labeling (FAM, HEX, TAMRA, ROX). PCR amplifications, product separation, and scoring were performed as described [38].

Data Analysis
The genetic diversity indices, including the number of alleles (N A ), effective number of alleles (N E ), observed heterozygosity (H O ), expected heterozygosity (H E ), Shannon's information index (I), and fixation index (F), were assessed at each locus and population levels using GeneAlEx version 6.5 [39]. The Hardy-Weinberg equilibrium (HWE) at each locus for each population was determined using GENEPOP version 4.2 [40,41]. The polymorphism information content (PIC) values were calculated using the program PIC_CALC version 0.6 [42]. To estimate the genetic variation among and within populations, the analysis of molecular variance (AMOVA) function of GeneAlEx version 6.5 [39] was performed. The average genetic differentiation index (F st ) and gene flow (N m ) were also calculated using GeneAlEx version 6.5 [39]. Gene flow estimation among the three populations was constructed using Migrate-n version 3.6.11 [43].
To examine the number of differentiated populations, STRUCTURE version 2.3.4 [44] based on a Bayesian analysis was run with K set values of 1 to 8. For each K, 10 independent runs were performed with a burn-in period of 100,000 iterations and a Markov chain Monte Carlo of 100,000 iterations. The most likely K was determined by combining two different approaches proposed by [44,45]. The results from STRUCTURE were processed with the software STRUCTURE HARVEST [46]. The principal coordinate analysis (PCoA) function of GeneAlEx version 6.5 [39] was used for further analysis of genetic structure. The genetic distances (GDs) as described by Nei et al. [47] between the individuals and populations were calculated using PowerMarker version 3.51 [48] and GeneAlex version 6.5 [39], respectively. Based on the resulting genetic distance matrix, a neighbor-joining (NJ) phylogenetic tree was constructed using MEGA version 6.0 [49] with 1000 bootstrap replicates.

Sequencing, Contigs Assembly and Functional Annotation
Adult T. koraiensis tree from LGZ belonging to an existing natural population were used to construct a RAD library for sequencing ( Figure 1). In total, 39.270 M raw reads were obtained from the RAD library. After quality filtering, 36.476 M (92.9%) clean reads were obtained. The Q20 and Q30 was greater than 96.4% and 91.9%, respectively, which indicates that data quality was very high (Table S1). The average GC content was 41.38% and the reads were assembled into 875,792 contigs with an average length of 262 bp (N50 = 274 bp) ( Table 2). Functional annotation of the contigs in T. koraiensis was searched against seven public databases. The BLASTX search against the Nr protein database revealed that 61.5% of contigs were present in unknown species, and the second-top hit species was Vitis vinifera, which accounted for 11.3% of the identified contigs ( Figure S1). Among the T. koraiensis contigs, 50,298 contigs were assigned to one or more GO terms. There were 96,568 contigs categorized under biological processes; 69,335 under cellular component and 66,749 under molecular function ( Figure 2). A total of 26,674 (3.0%) contigs were assigned to 25 KOG classifications. The cluster for general function prediction only represented the largest cluster (12,138 contigs), which is only related to basic metabolic and physiological functions, and only a few contigs were assigned to other clusters (Table S2). In the KEGG pathway-based analysis, 6805 contigs had significant matches and were assigned to 353 biological pathways. Of these, the spliceosome (ko03040, 233 contigs, 1.73%) pathway contained the largest number of contigs, followed by carbon metabolism (ko01200, 229 contigs, 1.70%) and protein processing in endoplasmic reticulum (ko04141, 210 contigs, 1.56%) (Table S3).

Development, Validation and Polymorphism of SSR Markers
SSR loci with mononucleotide motifs were excluded from primer design, because they easily lead to high mismatch ratios during DNA amplification [50]. A total of 899 primer pairs were designed by Primer3. In this study, we randomly selected and tested 300 primer pairs, of which 96 yielded fragments of expected sizes and showed stable amplification (Table S6), and the remaining primer pairs failed to yield any products. As assessed by capillary electrophoresis, 84 (28%) of the SSR markers were monomorphic or had very low levels of polymorphism that exhibited clear, single peaks for each allele, and 12 (4%) were polymorphic for multiple alleles. The allelic PIC values were 0.124-0.798, with six of 12 possessing highly informative scores (PIC > 0.50), five having moderately informative scores (0.50 > PIC > 0. 25), and one possessing a weakly informative score (0. 25 > PIC > 0). The average PIC value was 0.492. Markers BFTK-273 and BFTK-194 had the highest and lowest genetic diversity, respectively, with 0.820 and 0.133 for H E values, respectively, and 0.798 and 0.124 for PIC values, respectively. Ten of the 12 polymorphic markers showed HWE, while the others showed significant deviations from HWE after a Bonferroni correction, which was probably because of small sample size and heterozygote deficiency ( Table 4).

Population Genetic Diversity and Differentiation
In the three natural populations of T. koraiensis, the genetic diversity parameter N A per population varied  (Table 5). These parameters suggested a relatively high genetic diversity within and among T. koraiensis populations. DJG had the lowest values for H E , I, and PIC, suggesting that individuals from DJG had lower genetic diversity than those in the LGZ and SDG populations. The AMOVA analysis revealed that 17% of the total variance occurred among the populations and that 83% was due to variance within three populations ( Table 6), indicating that a higher variation level resided within populations than among populations of T. koraiensis. The F ST and N m values were almost congruent with the AMOVA results. The highest F ST value observed was only 0.078 between SDG and DJG, while the lowest, between LGZ and DJG, was 0.048 ( Table 7). The lowest N m value was 2.940 and was observed between SDG and DJG, while the highest was 4.958 and observed between LGZ and DJG ( Figure 3).

Population Genetic Structure
Analysis with STRUCTURE HARVESTER software showed that ∆K is largest at K = 3, and second largest at K = 2 (Figure 4a). Furthermore, Bayesian methods implemented in STRUCTURE were used to analyze the genetic structure of 232 T. koraiensis individuals (Figure 4b). The individual with the probability higher than a score of 0.80 was considered as a pure one, and lower than 0.80 as an admixture one. The clusters labeled in red, green, and blue included 55 (  To further evaluate the population structure, the PCoA was conducted. The first two principal coordinates explained 21.76% and 11.04% respectively, and explained 32.8% of the total variation (Figure 5a). An NJ phylogenetic tree was constructed based on a matrix of GDs among individuals (Figure 5b). The PCoA and NJ tree separated the sampled individual trees into three clusters corresponding to their geographical origins, although slight admixed features were observed in each cluster, which was consistent with the STRUCTURE analysis. The GDs of the populations were also calculated and ranged from 0.135 (between LGZ and DJG populations) to 0.233 (between SDG and DJG populations) (Table S7).

Discussion
To establish conservation strategies, a good understanding of the genetic diversity and population structure of endangered species is the necessary first step towards a long-term goal of sustainability. However, research into T. koraiensis is still in its early stages, and genetic studies are very limited. For these reasons, using a RAD approach, we performed genome assembly of T. koraiensis for the development of SSR markers that were also employed to determine the genetic diversity in extant natural populations.
In this study, the sequence assembly generated 875,792 contigs with an average length of 262 bp (N50 = 274 bp). The number of contigs for T. koraiensis was higher than other plants in assembly analysis of RAD sequencing studies such as Cynara cardunculus [51], Helianthus annuus [52], Arundinaria faberi [53], and L. tibetica [32], suggesting that there were abundant EcoRI restriction sites in the T. koraiensis genome. The sequencing depth, assembly method, and large genome may have contributed to larger numbers of assembled contigs [30]. The N50 and average length of assembled contigs are longer than in other species such as A. faberi (N50 = 246 bp, average contig length = 240 bp), Yushania brevipaniculata (N50 = 249 bp, average contig length = 240 bp) [53], and Epimedium sagittatum (average contig length = 224.9 bp) [54]. In the T. koraiensis assembly, the total number, N50 value and average length of contigs are all superior to other RAD-sequenced species, which indicates better quality contig assembly [55]. The GC dinucleotide content for T. koraiensis assembled contigs was 41.38%, which is consistent with results from RAD sequencing studies in A. faberi (44%) [53] and S. austriacum (41.5%) [29]; somewhat higher than that in Arabidopsis thaliana (36%, TAIR 10 genome database), V. vinifera [56], C. cardunculus (37.4%) [51], and H. annuus (36.2%) [52], although the enzyme EcoRI (GAATTC) with rich AT may bias the dataset towards low GC content. These genome sequences will provide important information for developing molecular markers and conducting genome research on T. koraiensis.
The functional annotation of T. koraiensis contigs was searched against seven public databases. In the GO analysis, biological process was the most significantly enriched GO term, comprising 50,298 annotations, and metabolic process (30,758) was prominent, indicating that these are involved in important metabolic activities. The conifer species, including T. koraiensis, are adversity resistant, especially in cold resistance, but the mechanisms underlying this adaptation are unknown. In the KEGG pathway annotation, 1133 contigs were identified in signal transduction, such as ATP-binding cassette transporters (ABC transporters) and cyclic adenosine monophosphate (cAMP) and adenosine 5 -monophosphate (AMP)-activated protein kinase (AMPK) signaling pathways, which are relevant to stress responses. Among the environmental adaptation pathways, the circadian clock plays important roles in activating the adaptation of a species to the local environment by regulating physiological activities [57]. In total, 214 contigs were assigned to environmental adaptation, and 24 contigs were involved in circadian rhythm. These functional annotation results are in agreement with other studies in conifers, such as Sabina chinensis [55] and Platycladus orientalis [58]. They will make a valuable contribution to further studies of biological pathways, functions, structures, and interactions of specific genes in T. koraiensis.
Based on these contigs, we successfully developed 37,761 SSR loci. The frequency of RAD-derived SSR occurrence in T. koraiensis is higher than those observed in S. austriacum [29], G. hirsutum [30], and A. hypogaea (2.9%) [31], which indicates that the RAD technique is an effective method for discovering genome-wide SSRs in T. koraiensis and closely related species. In previous studies, di-or trinucleotide motifs were identified as most prevalent for most organisms [59], but there are exceptions [60][61][62]. The distribution of SSRs in T. koraiensis revealed that mononucleotide motifs were the most abundant (44.50%), followed by dinucleotide (39.18%) and trinucleotide (13.56%) motifs. These results agree with other observations in Prunus armeniaca [63], A. thaliana, and Populus trichocarpa [64]. The most abundant dinucleotide motif is consistent with G. hirsutum and S. melongena (AT/AT), but the most abundant trinucleotide motif is different from these species (ATC/ATG T. koraiensis, for AAT/ATT for G. hirsutum) [28,30]). It is interesting that there were no CG/GC repeat motifs and very few CCG/CGG repeats in T. koraiensis, which strongly supports previous studies which indicated that CG/GC and CCG/CGG repeats are very infrequent in a large number of dicotyledonous plants, but the most predominant in monocots [54,65,66]. Among the 16,802 mononucleotide SSRs detected, 14,301 were A/T; these loci have been suggested to fill gaps in linkage maps constructed with higher order SSRs [65]. The frequency and distribution patterns vary depending on many factors including mining tool used, size of sequence dataset, and SSR identification criteria applied, whereas SSR abundance greatly depends on plant species [31,67].
A relatively high genetic diversity was found using the SSR marker analysis in 232 T. koraiensis individuals from three natural populations, with mean N A and H E values of 5.481 and 0.548, respectively. The genetic diversity of T. koraiensis was slightly lower than that of conifer species that have widespread geographic ranges, including Thuja occidentalis (mean N A = 8.83; H E = 0.64 in the core populations; mean N A = 6.64; H E = 0.60 in the peripheral populations) [68], Pinus densiflora (mean N A =14.6; H E = 0.873 within 1883 individuals from 62 natural populations) [69], and P. orientalis (mean N A = 8.945; H E = 0.832 from 21 populations) [70]. Compared with relatively abundant species, especially their widespread congeners, rare species, endemic plants and endangered species usually exhibit lower genetic diversity [71,72]. However, the values determined here represented higher SSR-derived genetic diversity than in other endangered endemic perennial species, such as Paeonia jishanensis (mean N A = 2.376; H E = 0.340 within 236 individuals from 10 extant populations) [73], Taxus wallichiana (mean N A = 4.154; H E = 0.538 within 130 individuals from 13 geographically separate populations) [74], and Pinus dabeshanensis (mean N A = 3.700; H E = 0.360 within 148 samples from four extant populations) [75]. The average value of H O (0.644) was slightly greater than that of H E (0.548), and the value of F was negative at the population level, indicating that excess heterozygosity was existed within the entire natural distribution range of the species.
Moderate population differentiation (pairwise F st = 0.048-0.078) and weak population structure were found among the three natural populations, which may be subject to local habitat conditions. The AMOVA analysis revealed that most (83%) of the total molecular variance was within-population, which are similar to results obtained for other Cupressaceae species including Taxodium distichum (81.47%) [76], T. occidentalis (88%) [68], and Juniperus thurifera (90.50%) [77], based on SSR markers. However, the extent of the population differentiation within T. koraiensis is much greater than that within P. orientalis, in which the genetic variation was only 1.25% among populations, with greatest variation among individuals within populations and within individuals (98.75%) [36]. This variation difference may be explained by the larger population size and wider distribution of P. orientalis compared with those of T. koraiensis. The genetic differentiation into distinct populations is strongly influenced by genetic drift, gene flow, long-term evolution, mating systems, selection, and mutations [78,79]. In this study, the N m value among T. koraiensis populations ranged from 2.940 to 4.958, indicating the frequent flow of genes and continuous distribution of populations. The population structure analysis divided the three populations into three clusters that were significantly related to their geographical origins, and some trees from the populations had a mixed ancestry, especially in the LGZ and SDG populations. According to botanical characteristics, T. koraiensis is wind-pollinated and its pollen can spread over long distances, which may explain the high level of genetic diversity. The studied trees of T. koraiensis were sampled within a 220-km radius of Jilin Province. The relatively small geographic range and anemophilous pollination could produce more frequent gene exchange events and result in little differentiation between populations.
The main goal of conservation is to develop a suitable strategy for maintaining current genetic diversity and ensuring the long-term evolution of an endangered species [80,81]. In the field survey, the numbers of adult trees in the three studied populations were all small, with only 35 adult individuals observed in the LGZ population. Many seedlings of different ages and a number of immature individuals were present in three extant natural forestry centers. In many areas, including the Sanchazi and Xiatianping natural forestry centers, naturally occurring individuals were reportedly present in 2015 [82]. However, no additional individuals were found during this survey (personal observation). We hypothesize that the considerable reduction in population size and the current narrow distribution may be the result of habitat loss and excessive deforestation. The SDG and DJG populations, which exhibited small population sizes and lower genetic diversity, are susceptible to anthropogenic interference. Thus, a conservation strategy should be prioritized for in situ conservation, including protecting as many populations and individuals as possible to avoid further losses of genetic variation and the use of cuttage seedlings to increase the population sizes. In addition, the combination of in situ and ex situ conservation approaches can be critical for safeguarding valuable genetic resources [83]. For the LGZ population, which has the most individuals and the highest genetic diversity, a germplasm resource repository could be established by transplantation and artificial breeding. The three extant natural populations of T. koraiensis of LGZ, SDG, and DJG should serve as a valuable baseline for future monitoring of the effectiveness of a conservation strategy to maintain and restore genetic diversity in this conifer.

Supplementary Materials:
The following are available online at http://www.mdpi.com/2073-4425/9/4/218/s1, Figure S1: Distribution of the top BLASTX hits for the contigs in the NCBI nonredundant protein (Nr) database. Table S1: Summary data of read using FastQC after quality control. Table S2: The Cluster of Orthologous Groups for eukaryotic complete genomes (KOG) annotation. Table S3: The Kyoto Encyclopedia of Genes and Genomes (KEGG) annotation. Table S4: Frequencies of repeat type with repeat numbers in SSRs from T. koraiensis. Table S5: Frequencies of different repeat motifs in SSRs. Table S6: The list of 96 SSRs with stable amplification. Table S7: Nei's genetic distance (below diagonal) and genetic identity (above diagonal) among the three populations.