De Novo Transcriptomic Analysis and Development of EST – SSRs for Styrax japonicus

Styrax japonicus sieb. et Zucc. is widely distributed in China with ornamental and medicinal values. However, the transcriptome of S. japonicus has not yet been reported. In this study, we carried out the first transcriptome analysis of S. japonicus and developed a set of expressed sequence tag–simple sequence repeats (EST–SSRs). We obtained 338,570,222 clean reads in total, of which the mean GC content was 41.58%. In total, 136,071 unigenes were obtained having an average length of 611 bp and 71,226 unigenes were favorably annotated in the database. In total, we identified 55,977 potential EST–SSRs from 38,611 unigenes, of which there was 1 SSR per 6.73 kb. The di-nucleotide repeats (40.40%) were the most identified SSRs. One set of 60 primer pairs was randomly selected, and the amplified products in S. japonicus were validated; 28 primer pairs successfully produced clear amplicons. A total of 21 (35%) polymorphic genic SSR markers were identified between two populations. In total, 15 alleles were detected and the average number was 6. The average of observed heterozygosity and expected heterozygosity was 0.614 and 0.552, respectively. The polymorphism information content (PIC) value fluctuated between 0.074 and 0.855, with a mean value of 0.504, which was also the middle level. This study provides useful information for diversity studies and resource assessments of S. japonicus.


Introduction
Styrax japonicus Sieb.et Zucc.(Japanese snowbell) is a deciduous tree species with white bell-shaped flowers that is widely distributed in eastern Asia and China [1], with high ornamental and medicinal values.Owing to increasing urban greening, more attention is being devoted to S. japonicus due to its ornamental value.Despite its importance, limited basic bioinformatics data on S. japonicus are accessible.Previous studies on this species mainly focused on seed germination [2,3] and tissue culture [4].However, little has been reported about the bioinformatics information.
A variety of molecular markers have been developed and largely applied for genetic diversity analysis, DNA fingerprinting, cultivar identification, molecular-assisted breeding, genetic linkage mapping, and quantitative trait loci mapping [5,6].Simple sequence repeats (SSRs) are usually significantly polymorphic and informative, and abundantly spread across the genome in various plants [7].SSRs possess abundant information and have high reproducibility, locus specificity, ease of scoring, high intraspecific polymorphism, and multiallelic and good genome coverage [8].SSR markers have been broadly used in genetic structure and genetic diversity evaluations [9], and based on their source, such markers are split into two categories: EST-SSRs (expressed-sequence-tag SSRs), determined from transcribed RNA sequences, and g-SSRs (genomic SSRs), determined from arbitrary genomic sequences [10].In contrast to genomic SSRs, EST-SSRs have more interspecific transferability, Forests 2018, 9, 748 2 of 14 are easy to identify, and they are more evolutionarily conserved, controlling a certain phenotype [11].EST-SSRs markers have been developed and applied to various plants [12,13].
Next-generation sequencing (NGS) platforms have been used widely for obtaining high-throughput genomic data, of which de novo transcriptome sequencing has become an especially rapid and sophisticated technology [8,14].High-throughput NGS is an accurate technology that is low cost, efficient, and reliable, and it is a powerful tool.Consequently, it can be applied to discover and develop molecular markers for non-model species [15,16].Furthermore, transcriptome sequencing can be applied for the detection of functional genes, analyzing differential gene expression analysis, and transcriptome profiling [17,18].NGS technologies make great contributions to the fields of conservation genetics and evolution by generating an enormous amount of large-scale data at whole-genome and whole-transcriptome levels.
Before NGS, SSR technology was difficult to use due to it being time consuming, having a high cost, and being labor intensive.Furthermore, it is economically costly to identify SSR motifs [19].There is only a limited number of available molecular markers for S. japonicus [1], and polymorphic SSR markers in S. japonicus development have not been reported.These issues have been a constraint for S. japonicus genetic studies.With the ongoing creation of next-generation sequencing platforms, RNA-Seq has slowly superseded gene chip platforms [20], and RNA-Seq has been used to develop SSR markers for Sorbus pohuashanensis (Hance) Hedl.[21], Trifolium pretense L. [22], Psophocarpus tetragonolobus Linn.[23], Cyamopsis tetragonoloba Linn.[24], Rhododendron rex Levl.[25], and Quercus austrocochinchinensis Hjelmq.[26].Thus, the development of several SSR markers based on RNA-Seq for breeding strategies is urgently needed.In this study, we first develop and validate a set of genic SSR markers for S. japonicus through an Illumina sequencing platform, which can provide more information for assessing the genetics and extending the capacity of plant breeding approaches in this particular species.

Plant Materials
For transcriptomic analysis, nine S. japonicus plants with vigorous growth were collected from the National Forest Genetic Resources Platform (NFGR) of Qingdao Agriculture University.Fresh and healthy flowers were gathered from 9 plants and quickly frozen in liquid nitrogen.

DNA and RNA Isolation
Total genomic DNAs were extracted according to Wiland-Szyma ńska [27].The total RNA was isolated following Ghawana [28].To check DNA quality and quantity, 1% agarose gels and an Eppendorf BioSpectrometer (Eppendorf, Saxony, Germany) were used separately.The quantity and quality of extracted RNA and DNA were assessed through agarose gel electrophoresis.For cDNA library preparation, the amount of RNA used exceeded 25 ng.

cDNA Library Construction and Sequencing
To construct the cDNA library, we used the methodology from the Biomarker Technologies Corporation (Beijing, China) according to the company's directions.The following steps were conducted according to the Illumina workflow protocol: First-and second-strand cDNA synthesis, cDNA purification, end repair, adaptor ligation, size selection library enrichment, and purification.The cDNA quality assessment and quantification were performed using a Qubit Fluorometer (Thermo Fisher Scientific, Waltham, MA, America) and a Qseq100 DNA Analyzer (Bioptic Inc., Taiwan, China).cDNA libraries sequencing was performed on the Illumina HiSeq™ 2000 platform (Shanghai Biotechnology Corp., Shanghai, China).

Sequence Assembly and Annotation
The raw data were filtered by eliminating the adaptors, reads with low overall quality (quality score less than 20), and reads with an uncertain identity base.Unigenes were then de novo assembled using trinity [29].Various public databases were used to align the unigenes for annotation, including NCBI Nr, Swissport, KOG, KEGG, and Pfam (E-value ≤ 1.0 × 10 −5 ) to obtain the highest sequence similarity annotation and classification information [25].According to the molecular function, biological process, and cellular component ontologies, Blast2GO was used to obtain GO annotation categories [30].Furthermore, the KOG database was used to categorize and analyze all of the unigenes for predicting their possible functions.Finally, we used the KEGG Automatic Annotation Server to annotate ESTs into KEGG pathways [31].

Identification of EST-SSR Loci and Primer Design
Analysis of potential SSRs was conducted using MISA software (1.0 Version) [21].For the SSRs, the following was selected, ten for the mono-nucleotides, five for di-nucleotides, four for tri-nucleotides, three for tetra-, penta-, and hexa-nucleotides.SSR primers were designed using Batch Primer 3 software (http://probes.pw.usda.gov/cgi-bin/batchprimer3/batchprimer3.cgi).The primer lengths were typically 20 bp (18 to 22 bp), and the annealing temperature was between 55 • C and 60 • C. The Polymerase Chain Reaction (PCR) product size was 100 to 300 bp.The optimum content of GC was approximately 50%, ranging from 40% to 60%.Based on the sequences containing SSRs and the metabolic pathway genes, one set of 60 primer pairs was selected.Primer synthesis was conducted at the Sangon Biotech Company (Shanghai, China).

Validation and Application of SSR Markers
To verify the polymorphism of EST-SSRs, a total of 40 individuals from two populations (Qingdao 20 and Weihai 20) were selected.We extracted DNA from 40 leaf samples according to the CTAB method [32].PCR reactions were performed in the next steps: 5 min of 94 • C pre-denaturation, 45 s of 35 cycles of 94 • C, 45 s of annealing at an optimal temperature ranging from 55 • C to 60 • C, a 45-s extension at 72 • C, then a last 10-min extension at 72 • C. Each of the PCR products were confirmed using 6% polyacrylamide gel electrophoresis [33].The amount of alleles and Wright's F statistics parameters (Fst), observed heterozygosity (Ho), and expected heterozygosity (He) were quantified by POPGENE 1.32 [34], polymorphism information content (PIC) values were obtained by PIC_CALC software [35].

Sequencing and De Novo Assembly
In total, 364,794,740 paired-end raw reads were produced by Illumina next-generation sequencing for S. japonicus.After data filtering, we obtained 338,570,222 clean reads, of which the GC content was 41.58%.The total nucleotide number of the clean reads was 49,540,935,462 nt, with a mean Q20 and Q30 higher than 98.01% and 94.67%, respectively (Table 1).The total number of unigenes was 136,071, and their total length was 83,181,839 bp.The mean length was 611 bp, and the value of N50 was 846 bp (Table 1).Among the 136,071 unigenes, 91,282 (67.08%) ranged in length from 200 to 499 bp, 24,231 (17.81%) from 500 to 999 bp, 8949 (6.58%) from 1000 to 1499 bp, 5302 (3.90%) from 1500 to 1999 bp, and 6307 (4.63%) had a length greater than or equal to 2000 bp (Figure 1).from 1000 to 1499 bp, 5302 (3.90%) from 1500 to 1999 bp, and 6307 (4.63%) had a length greater than or equal to 2000 bp (Figure 1).

Functional Annotation of Unigenes
Multiple databases, including Nr, SwissPort, GO, KOG, KEGG, TrEMBL and Pfam, were used to obtain the corresponding annotation information from the final unigene set (Table S1).In total 71,226 unigenes were favorably annotated in the database.A total of 64,752 (47.6%) unigenes aligned Forests 2018, 9, 748 5 of 14 to the Nr database, and nearly 35,860 unigenes (55.38%) of the sequences revealing greater than 80% similarity (Figure S1).

Functional Annotation of Unigenes
Multiple databases, including Nr, SwissPort, GO, KOG, KEGG, TrEMBL and Pfam, were used to obtain the corresponding annotation information from the final unigene set (Table S1).In total 71,226 unigenes were favorably annotated in the database.A total of 64,752 (47.6%) unigenes aligned to the Nr database, and nearly 35,860 unigenes (55.38%) of the sequences revealing greater than 80% similarity (Figure S1).

Development and Validation of Novel EST-SSRs
Based on the sequences containing SSRs and the metabolic pathway genes, one set of 60 primer pairs was randomly selected, and the amplified products in S. japonicus were validated.Among the selected 60 primer pairs, 28 primer pairs successfully produced clear amplicons.Using four S. japonicus individuals from two populations as PCR models, the outcomes reveal that 21 of the 28 primer pairs were polymorphic, which containing six di-, six tri-, four tetra-, three penta-, one hexa and one compound motifs, while the other nine primer pairs were monomorphic.
We conducted an extra validation for 40 individuals from two populations to confirm the 21 polymorphic genic SSRs developed.Fifteen alleles were noted, the amount of alleles ranged from 2 to 15, with an average six.The PIC value fluctuated between 0.074 and 0.855, and the mean value was 0.504.Observed heterozygosity varied from 0.025 to 1.000, and the average was 0.614.The anticipated heterozygosity was from 0.074 to 0.882, with an average of 0.550; the Fst values were from 0.004 to 0.173, with a mean of 0.046.(Table 3).

Characterization of the S. japonicus Transcriptome
Recently, NGS has been widely applied in many non-model plants [26].In the evaluation of genetic diversity, cultivar DNA fingerprinting, genome organization, and marker-assisted breeding, and in genome-wide association studies [36,37], SSR markers have been widely used.Transcriptome sequencing is successful in developing SSRs in many organisms [10].S. japonicus is widely distributed in China, with ornamental and medicinal value.However, as a valuable species, the transcriptome of S. japonicus has not been reported.Therefore, the lack of available genetic information has limited the exploitation of this plant.In the present evaluation, the transcriptome of S. japonicus was initially documented by utilizing Illumina sequencing technology.A total of 364,794,740 paired-end raw reads were obtained, and 338,570,222 clean reads were generated with a 98.01%Q20 level and the "N" base was 0.01%, guaranteeing the quality of sequencing.Previous studies on Elymus sibiricus Linn.[8] and R. rex [25] with comparable Q20 level and the "N" base have confirmed the high quality of the sequencing.The unigenes numbers assembled were less than those of R. rex (164,242) [25] but more than those of Juglans cathayensis Dode.(116,814) [38], Jatropha curcas L. (115,611) [39], and Salix psammophila C. Wang et Ch.Y. Yang.(71,458) [40].The average unigene length was shorter than that of Quercus kerrii Hu. (719 bp) [26], Pinus bungeana Zucc.ex Endl.(922 bp) [41], and R. latoucheae Franch.(709 bp) [42] but longer than that of Myrica rubra Sieb.(531 bp) [43] and Camellia sinensis O. Ktze.(380 bp) [44].Overall, the result of the present evaluation will contribute to the genetic improvement of S. japonicus.
The final unigene set was used for comparison with the NCBI database.The annotated sequence information can be effective for investigating genetic differentiation of S. japonicus in the future.In our study, a total of 71,226 unigenes were successfully annotated in the database.The species-based annotated unigenes showed that S. japonicus is closely related to V. vinifera, C. canephora, and N. nucifera, which obviously does not get supported by taxonomic studies.This contradiction is most likely be caused by the fact that there is no report on the transcriptome of the Styracaceae family, and 34,211 (52.83%) genes were annotated to other species.Therefore, it is most likely that in the future the most related species of Styrax japonicus will change, with more species of Styracaceae being annotated in the NCBI.There is less research on the bioinformatics information of the Styracaceae family, and 34,211 (52.83%) unigenes were annotated to other species.Of the three GO classifications, the greatest were cellular processes and cell and binding activity.This result is consistent with that for Hevea brasiliensis Muell.and Litchi chinensis Sonn.[45,46].With respect to the KOG category, the general function prediction was the largest category, which is consistent with a previous study [47].According to the KEGG pathway database, 18,355 (13.50%) unigenes significantly matched and were put into six categories, such as 128 KEGG pathways.The annotation information provides sufficient transcriptomic data and also increases our understanding of the active metabolic processes in S. japonicus.

SSR Markers in the Transcriptome of S. japonicus
It is common knowledge that SSR markers have been broadly utilized for genetic structure and analysis [48,49].Nevertheless, their use depends on the creation of SSR marker primers [50].EST-SSRs can be utilized to acquire data on gene expression directly since they are tightly linked to functional genes.Numerous EST sequences have been determined from transcriptome information [14].Transcriptome sequences can be used to create EST-SSR markers to facilitate research into genetic diversity and marker-assisted selection.A total of 55,977 possible EST-SSRs were determined (including 14,447 mono-nucleotides) in the present study, with di-nucleotide repeats the most noted SSRs, similar to previous studies [43,51,52].The frequency of AG/CT was the highest in the di-nucleotide repeats and the CG/CG motif was the least abundant, which might be related to the methylation of cytosine [42].The frequency of AAG/CTT was the greatest in the tri-nucleotide repeats.This finding is consistent with previous studies, which also indicated that the AAG motif might be more important in dicotyledonous plants [53,54].The average of one SSR site was found per 6.73 kb, which is similar to that observed for E. sibiricus (1/6.59 kb) [8], lower than Neolitsea sericea Koidz.(1/3.8 kb) and R. latoucheae [13,42], and higher than Paeonia suffruticosa Andr.(1/9.24 kb) [55].The frequency of SSRs mainly depends on the mining tool used, database size, the genome structure, and the species differences [56].
We detected 15 alleles among 40 S. japonicus individuals based on the 21 EST-SSR markers.Allele numbers varied from 2 to 15, and the average was 6.The average of observed heterozygosity and expected heterozygosity was 0.614 and 0.552, respectively.The PIC value is usually used to evaluate the level of genetic information [55].Our results showed that the PIC value fluctuates from 0.074 and 0.855, and the mean value was 0.504, which is also the middle level.The created EST-SSR markers are therefore an efficient instrument for performing genetic analysis within the broad range of S. japonicus populations.

Figure 1 .
Figure 1.Length distribution of unigenes.Horizontal and vertical axes show the size and number of unigens, respectively.What is counted is the number of the unigenes with the length of over 200bp.The statistics of unigenes equal to or more than 2000 bp are classified as a group.The number of unigenes with different length ranges is indicated by columns with different colors.

Figure 1 .
Figure 1.Length distribution of unigenes.Horizontal and vertical axes show the size and number of unigens, respectively.What is counted is the number of the unigenes with the length of over 200 bp.The statistics of unigenes equal to or more than 2000 bp are classified as a group.The number of unigenes with different length ranges is indicated by columns with different colors.

Figure 2 .
Figure 2. Gene Ontology (GO) annotations of Styrax japonicus unigenes.The left y-axis indicates the percentage of a specific category of genes in each main category.The right y-axis represents the amount of unigenes in a category.

Figure 2 .
Figure 2. Gene Ontology (GO) annotations of Styrax japonicus unigenes.The left y-axis indicates the percentage of a specific category of genes in each main category.The right y-axis represents the amount of unigenes in a category.

ForestsFigure 3 .
Figure 3. KOG functional classification of Styrax japonicus unigenes.The y-axis indicates the amount of unigenes in a specific functional cluster.The x-axis shows function class.

Figure 3 .
Figure 3. KOG functional classification of Styrax japonicus unigenes.The y-axis indicates the amount of unigenes in a specific functional cluster.The x-axis shows function class.

ForestsForests 15 Figure 4 .
Figure 4. KEGG metabolic pathway of Styrax japonicus.The y-axis indicates the KEGG pathway, and the x-axis is the ratio of the number of unigenes.

Figure 4 .
Figure 4. KEGG metabolic pathway of Styrax japonicus.The y-axis indicates the KEGG pathway, and the x-axis is the ratio of the number of unigenes.

Table 1 .
Summary of transcriptome sequencing and assembly in Styrax japonicus.

Table 1 .
Summary of transcriptome sequencing and assembly in Styrax japonicus.

Table 2 .
Distribution of the different expressed sequence tag-simple sequence repeats (EST-SSRs) of Styrax japonicus.

Table 2 .
Distribution of the different expressed sequence tag-simple sequence repeats (EST-SSRs) of Styrax japonicus.

Table 3 .
Characteristics of the 21 novel EST-SSR markers in Styrax japonicus.