Next Generation Sequencing-Based Molecular Marker Development: A Case Study in Betula Alnoides

Betula alnoides is a fast-growing valuable indigenous tree species with multiple uses in the tropical and warm subtropical regions in South-East Asia and southern China. It has been proved to be tetraploid in most parts of its distribution in China. In the present study, next generation sequencing (NGS) technology was applied to develop numerous SSR markers for B. alnoides, and 64,376 contig sequences of 106,452 clean reads containing 164,357 candidate SSR loci were obtained. Among the derived SSR repeats, mono-nucleotide was the main type (77.05%), followed by di- (10.18%), tetra- (6.12%), tri- (3.56%), penta- (2.14%) and hexa-nucleotide (0.95%). The short nucleotide sequence repeats accounted for 90.79%. Among the 291 repeat motifs, AG/CT (46.33%) and AT/AT (44.15%) were the most common di-nucleotide repeats, while AAT/ATT (48.98%) was the most common tri-nucleotide repeats. A total of 2549 primer sets were designed from the identified putative SSR regions of which 900 were randomly selected for evaluation of amplification successfulness and detection of polymorphism if amplified successfully. Three hundred and ten polymorphic markers were obtained through testing with 24 individuals from B. alnoides natural forest in Jingxi County, Guangxi, China. The number of alleles (NA) of each marker ranged from 2 to 19 with a mean of 5.14. The observed (HO) and expected (HE) heterozygosities varied from 0.04 to 1.00 and 0.04 to 0.92 with their means being 0.64 and 0.57, respectively. Shannon-Wiener diversity index (I) ranged from 0.10 to 2.68 with a mean of 1.12. Cross-species transferability was further examined for 96 pairs of SSR primers randomly selected, and it was found that 48.96–84.38% of the primer pairs could successfully amplify each of six related Betula species. The obtained SSR markers can be used to study population genetics and molecular marker assisted breeding, particularly genome-wide association study of these species in the future.


Introduction
Betula alnoides Buch. Ham. ex D. Don (Betulaceae) is a fast-growing valuable indigenous tree species with multiple uses in the tropical and warm subtropical regions in South-East Asia and southern China. In China, its main natural distribution is in Yunnan, Guangxi, and Guizhou provinces [1]. The wood has beautiful texture, moderate density, low crack and deformation rate, and easy processing, making it suitable for floor timber and high-grade furniture making, high-class interior

Distribution of SSR in the Genome
A total of 64,376 contig sequences, about 60.47%, of 106,452 clean reads, contained SSR loci. Among these, 25,431 sequences contained only one SSR locus, and the other 38,945 sequences contained at least two SSR loci. In total 164,357 candidate SSR loci were obtained ( Table 1). The total size of examined sequences was 106,901,537 bp, and the size of the SSR repeat lengths ranged from 10 bp to 75 bp with a mean length of 16 bp. Mono-nucleotide repeats accounted for the largest percentage (77.05%) among all derived loci, followed by di-nucleotide (10.18%) and tetra-nucleotide (6.12%) repeats ( Table 1). The frequency of tri-nucleotide, penta-nucleotide and hexa-nucleotide repeats was lower, being 3.56, 2.14 and 0.95%, respectively. Our study found thousands of times more microsatellite sequences than those obtained by using the DNA library and the probe hybridization enrichment methods [31]. For example, only 58 microsatellite-containing fragments were obtained by screening genomic libraries with probes for B. alnoides [26] while 38 microsatellite sequences were obtained for B. pendula [32]. Among the abundant mono-to hexa-nucleotide repeats from the present study, mono-, di-and tetra-nucleotides were the major genomic microsatellite nucleotide types, while tri-, penta-and hexa-nucleotides were of relatively low proportion. Similar results have also been observed in other woody plants based on the same sequencing technology, such as Prunus persica, Carica papaya and Citrus sinensis [33].

Characteristics of SSR Sequences
It was found that 106,526 mono-nucleotide repeats (84.11% of the total) fell under 9-14 repeat number counts, and only four greater than 33 (Table 1). All the other nucleotide repeats were

Distribution of SSR in the Genome
A total of 64,376 contig sequences, about 60.47%, of 106,452 clean reads, contained SSR loci. Among these, 25,431 sequences contained only one SSR locus, and the other 38,945 sequences contained at least two SSR loci. In total 164,357 candidate SSR loci were obtained ( Table 1). The total size of examined sequences was 106,901,537 bp, and the size of the SSR repeat lengths ranged from 10 bp to 75 bp with a mean length of 16 bp. Mono-nucleotide repeats accounted for the largest percentage (77.05%) among all derived loci, followed by di-nucleotide (10.18%) and tetra-nucleotide (6.12%) repeats (Table 1). The frequency of tri-nucleotide, penta-nucleotide and hexa-nucleotide repeats was lower, being 3.56, 2.14 and 0.95%, respectively. Our study found thousands of times more microsatellite sequences than those obtained by using the DNA library and the probe hybridization enrichment methods [31]. For example, only 58 microsatellite-containing fragments were obtained by screening genomic libraries with probes for B. alnoides [26] while 38 microsatellite sequences were obtained for B. pendula [32]. Among the abundant mono-to hexa-nucleotide repeats from the present study, mono-, di-and tetra-nucleotides were the major genomic microsatellite nucleotide types, while tri-, penta-and hexa-nucleotides were of relatively low proportion. Similar results have also been observed in other woody plants based on the same sequencing technology, such as Prunus persica, Carica papaya and Citrus sinensis [33].

Characteristics of SSR Sequences
It was found that 106,526 mono-nucleotide repeats (84.11% of the total) fell under 9-14 repeat number counts, and only four greater than 33 (Table 1). All the other nucleotide repeats were mainly within 3-8 repeat number category, 82.37% for di-, 96.91% for tri-, and nearly 100% for tetra-, pentaand hexa-nucleotide repeats, respectively.
There were 291 types of repeat motifs in total. AG/CT (46.33%) and AT/AT (44.15%) were the most common repeat motifs in four types of di-nucleotide, whereas CG/CG accounted for only 0.27% (Figure 2a). These results are in accordance with previous studies reported by Gupta et al. [34] and Powell et al. [35]. In addition, the GC motif of lower frequency may be associated with cytosine methylation, which can be converted into a thymidine by deamination [36]. Among the ten types of repeat motifs in tri-nucleotide found in the present study (Figure 1b), the highest frequency repeat motif was AAT/ATT (48.98%), and the lowest was CCG/CGG (0.72%, Figure 2b). The main repeat motifs of tri-nucleotide were different from other plant species such as maize (CCG/GGC), rice (AGG/TCC) and barley (CCG/GGC) [37]. Repeat motifs AAAT/ATTT (60.85%), AAAAT/ATTTT (32.29%) and AAAAAT/ATTTTT (25.91%) observed in the present study were the most frequent in tetra-, pentaand hexa-nucleotide, respectively (Figure 2c-e).
Molecules 2018, 23, x FOR PEER REVIEW 4 of 10 mainly within 3-8 repeat number category, 82.37% for di-, 96.91% for tri-, and nearly 100% for tetra-, penta-and hexa-nucleotide repeats, respectively. There were 291 types of repeat motifs in total. AG/CT (46.33%) and AT/AT (44.15%) were the most common repeat motifs in four types of di-nucleotide, whereas CG/CG accounted for only 0.27% (Figure 2a). These results are in accordance with previous studies reported by Gupta et al. [34] and Powell et al. [35]. In addition, the GC motif of lower frequency may be associated with cytosine methylation, which can be converted into a thymidine by deamination [36]. Among the ten types of repeat motifs in tri-nucleotide found in the present study (Figure 1b), the highest frequency repeat motif was AAT/ATT (48.98%), and the lowest was CCG/CGG (0.72%, Figure 2b). The main repeat motifs of tri-nucleotide were different from other plant species such as maize (CCG/GGC), rice (AGG/TCC) and barley (CCG/GGC) [37]. Repeat motifs AAAT/ATTT (60.85%), AAAAT/ATTTT (32.29%) and AAAAAT/ATTTTT (25.91%) observed in the present study were the most frequent in tetra-, penta-and hexa-nucleotide, respectively (Figure 2c-e).

Primer Screening and SSR Markers Polymorphism Detecting
In total 2549 pairs of SSR primers were designed from all candidate SSR loci, of which 138 pairs were excluded due to the occurrence of compound formation in repeat regions of relevant SSR sequences. The majority of nucleotide repeats were di-nucleotide (75.53%), followed by tri-nucleotide (20.24%). A total of 900 pairs of primers were randomly selected for evaluating the successfulness of amplification, among them di-, tri-, tetra-, penta-and hexa-nucleotide repeats accounted for 74.11%, 21.00%, 3.33%, 0.67%, and 0.89%, respectively, and 580 pairs of primers (64.44%) were amplified successfully with clear and distinguishable microsatellite bands. The optimum annealing temperature of these SSR primers was 60 • C as determined by a temperature trial with three individuals. Eventually, 310 primers were shown to detect polymorphisms in 24 samples from a natural B. alnoides forest in Jingxi County, Guangxi, China and are therefore designated polymorphic SSR loci. Characterizations of 310 polymorphic SSR loci are presented in Table S1, and the remaining 270 pairs generated monomorphic markers or fixed heterozygosity in 24 samples or had weak fluorescence signals in amplified products of most samples (Table S2). The fragment length of the amplified products was from 105 bp to 306 bp, which was within the expectation range (100-400 bp). None of the 310 SSR markers developed in the present study was the same as the 19 pairs developed in a previous study [26]. When we tested the usefulness of the 19 pairs developed by Guo et al. [26] for our 24 B. alnoides samples, it was found that all of them were identified as polymorphic loci (Table S3). Based on the previously reported SSR markers (Figure 3a) and the SSR markers developed in the present study (Figure 3b), a phylogenetic relationship was constructed between the 24 individuals of B. alnoides. It could be seen that the larger number of markers resulted in very different phylogenetic relationships between the 24 samples.

Primer Screening and SSR Markers Polymorphism Detecting
In total 2549 pairs of SSR primers were designed from all candidate SSR loci, of which 138 pairs were excluded due to the occurrence of compound formation in repeat regions of relevant SSR sequences. The majority of nucleotide repeats were di-nucleotide (75.53%), followed by tri-nucleotide (20.24%). A total of 900 pairs of primers were randomly selected for evaluating the successfulness of amplification, among them di-, tri-, tetra-, penta-and hexa-nucleotide repeats accounted for 74.11%, 21.00%, 3.33%, 0.67%, and 0.89%, respectively, and 580 pairs of primers (64.44%) were amplified successfully with clear and distinguishable microsatellite bands. The optimum annealing temperature of these SSR primers was 60 °C as determined by a temperature trial with three individuals. Eventually, 310 primers were shown to detect polymorphisms in 24 samples from a natural B. alnoides forest in Jingxi County, Guangxi, China and are therefore designated polymorphic SSR loci. Characterizations of 310 polymorphic SSR loci are presented in Table S1, and the remaining 270 pairs generated monomorphic markers or fixed heterozygosity in 24 samples or had weak fluorescence signals in amplified products of most samples (Table S2). The fragment length of the amplified products was from 105 bp to 306 bp, which was within the expectation range (100-400 bp). None of the 310 SSR markers developed in the present study was the same as the 19 pairs developed in a previous study [26]. When we tested the usefulness of the 19 pairs developed by Guo et al. [26] for our 24 B. alnoides samples, it was found that all of them were identified as polymorphic loci (Table S3). Based on the previously reported SSR markers (Figure 3a) and the SSR markers developed in the present study (Figure 3b), a phylogenetic relationship was constructed between the 24 individuals of B. alnoides. It could be seen that the larger number of markers resulted in very different phylogenetic relationships between the 24 samples.    Table S1. The number of alleles (N A ) of each locus ranged from 2 to 19 with a mean of 5.14, and 86.45% of loci had 2 to 8 alleles. The observed (H O ) and expected (H E ) heterozygosities of each locus varied from 0.04 to 1.00 (mean 0.64) and 0.04 to 0.92 (mean 0.57) respectively. The Shannon-Wiener diversity index (I) ranged from 0.10 to 2.68 with a mean of 1.12. The parameters of polymorphism decreased as nucleotide repeat varying from di-to tetra-nucleotides ( Figure 4). However, this trend could not be seen for penta-and hexa-nucleotides due perhaps to their less occupancy. As a whole, the percentage of polymorphic loci with di-nucleotide repeats was larger than that with other nucleotide repeats. It was indicated that a selection of di-nucleotide repeats might be more effective to screen out polymorphic microsatellite markers, which is similar to the results of a previous study on B. platyphylla [38]. The genetic variation revealed by all polymorphic SSR loci for 24 individuals is also demonstrated in Table S1. The number of alleles (NA) of each locus ranged from 2 to 19 with a mean of 5.14, and 86.45% of loci had 2 to 8 alleles. The observed (HO) and expected (HE) heterozygosities of each locus varied from 0.04 to 1.00 (mean 0.64) and 0.04 to 0.92 (mean 0.57) respectively. The Shannon-Wiener diversity index (I) ranged from 0.10 to 2.68 with a mean of 1.12. The parameters of polymorphism decreased as nucleotide repeat varying from di-to tetra-nucleotides ( Figure 4). However, this trend could not be seen for penta-and hexa-nucleotides due perhaps to their less occupancy. As a whole, the percentage of polymorphic loci with di-nucleotide repeats was larger than that with other nucleotide repeats. It was indicated that a selection of di-nucleotide repeats might be more effective to screen out polymorphic microsatellite markers, which is similar to the results of a previous study on B. platyphylla [38].

Cross-Species Transferability
Ninety six pairs of SSR primers were randomly selected to examine cross-species transferability among six related species in the genus Betula. The results showed that 48.96% of these SSR primers could be amplified in B. platyphylla, 73.96% in B. austro-sinensis, 82.29% in B. cylindrostachya, 78.13% in B. fujianensis, 79.17% in B. hainanensis, and 84.38% in B. luminifera (Table  S4). These results thus indicated that these SSR primers had a high rate of transferability among the six species. High rate of transferability between species in the same genus has also been verified by Wang et al. [39] who discovered that 86.10% of SSR primer pairs developed from other Prunus species was transferable to P. virginiana. This is because the transferability of SSR primers across species is determined by the conservativeness of flanking sequences of microsatellite and their evolutionary stability [40]. The genomic differences between the related species are relatively small, making SSR primers highly transferable across specie. Thus, a large number of available primer NA HO HE I

Cross-Species Transferability
Ninety six pairs of SSR primers were randomly selected to examine cross-species transferability among six related species in the genus Betula. The results showed that 48.96% of these SSR primers could be amplified in B. platyphylla, 73.96% in B. austro-sinensis, 82.29% in B. cylindrostachya, 78.13% in B. fujianensis, 79.17% in B. hainanensis, and 84.38% in B. luminifera (Table S4). These results thus indicated that these SSR primers had a high rate of transferability among the six species. High rate of transferability between species in the same genus has also been verified by Wang et al. [39] who discovered that 86.10% of SSR primer pairs developed from other Prunus species was transferable to P. virginiana. This is because the transferability of SSR primers across species is determined by the conservativeness of flanking sequences of microsatellite and their evolutionary stability [40]. The genomic differences between the related species are relatively small, making SSR primers highly transferable across specie. Thus, a large number of available primer pairs can be used in the genetic studies and molecular assistant breeding for the species in the genus Betula.

Plant Material and DNA Isolation
In total 24 individuals of B. alnoides were sampled from natural forest in Jingxi County, Guangxi, China. Among them, one sample was randomly selected for de novo genome sequencing, three samples were applied to screen out the successfully amplified SSR primers and optimum annealing temperatures, and all 24 individuals were used to evaluate polymorphism of the developed SSR loci. Six species in the genus Betula (Table 2) were used to assess cross-species transferability of these SSR loci. Fresh leaves were collected from each individual and dried separately with silica gel. Total genomic DNA was then extracted by modified CTAB method [41] and stored at −20 • C. The concentration and purity of DNA were determined by NanoDrop 2000 (Thermo Fisher Scientific Inc., Waltham, MA, USA).

Genome Sequencing, SSR Finding and Survey
De novo genome sequencing for B. alnoides was conducted using 1/6 of a run on the Roche 454 GS FLX + platform (454 Life Sciences, Roche Company, Branford, CT, USA) according to the process of Guo et al. [42]. For each sequence of raw data, the quality-control processes were conducted with the software Qiime (version 1.17, http://qiime.org/), which included: (1) the sequencing adapters were trimmed; (2) low-quality bases were trimmed with the index of the Q20 bases percent more than 90%; and (3) the ambiguous bases were deleted.
The MISA software was used to search SSR with following conditions: (1) the repeat number of the mono-nucleotide was at least ten; (2) the repeat number of the di-nucleotide was at least five; (3) the repeat number of the tri-nucleotide was at least four; and (4) the repeat numbers of the te-, penta-, and hexa-nucleotide were at least three. The characteristics of SSR were analyzed at the whole genome level.
The optimum annealing temperature of SSR primers was screened out from 56 • C, 58 • C and 60 • C. The PCR reaction mixture (10 µL) contained 50 ng of DNA template, 150 µM dNTPs, 2.0 µM MgCl 2 , 0.5 µM forward and reverse primers, 1× PCR buffer (Tiangen Biotech Ltd., Beijing, China), and 0.04 U/µL of Taq DNA polymerase (Tiangen Biotech Ltd., Beijing, China). PCR was performed on the PCR system (Applied Biosystems Veriti) according to the following program: 4 min for initial denaturation at 94 • C; 31 cycles of 30 s for denaturation at 94 • C, 30 s at annealing temperature and 30 s at 72 • C; and 10 min at 72 • C. The amplified products were stored at 4 • C, and detected by 1% agarose gel electrophoresis. Successful amplification was judged by clear and distinguishable microsatellite bands.

SSR Markers Polymorphism Detection
Polymorphism of SSR loci was detected with the M13-tailed primers method [43]. SSR forward primers were synthesized and labeled with M13 sequence (5 -CACGACGTTGTAAAACGAC-3 ) at 5 end, and the reverse primers were not labeled. The M13 primer was labeled with a fluorescent dye FAM, NED, VIC, or ROX. The PCR was performed with a reaction mixture (10 µL) containing 50 ng of DNA template, 150 µM dNTPs, 2.0 µM MgCl 2 , 0.5 µM fluorescent forward M13-labeled primers and reverse primers, 1× PCR buffer (Tiangen Biotech Ltd., Beijing, China), and 0.04 U/µL of Taq DNA polymerase (Tiangen Biotech Ltd., Beijing, China). The PCR program was used as mentioned above.
The PCR products were detected using 3730 XL automatic sequencer (ABI Co., Foster, CA, USA). The genotyping of products was performed using GeneMarker V2.2.0 [44]. The number of alleles (N A ) and observed heterozygosity (H O ) were calculated with the results of genotyping in Excel 2010 (Microsoft Corporation, Redmond, WA, USA), the expected heterozygosity (H E ) and Shannon-Wiener diversity index (I) were generated using ATETRA [45], which is specially applied for the analysis of tetraploid. The Nei's genetic distance between the 24 samples was computed by NTSYS 2.1 [46] software. The software was also used to obtain clustering graph of 24 samples based on Nei's genetic distance by the unweighted pair group method with arithmetic (UPGMA) method.

Conflicts of Interest:
The authors declare no conflict of interest.