The Complete Chloroplast Genome Sequences of the Medicinal Plant Forsythia suspensa (Oleaceae)

Forsythia suspensa is an important medicinal plant and traditionally applied for the treatment of inflammation, pyrexia, gonorrhea, diabetes, and so on. However, there is limited sequence and genomic information available for F. suspensa. Here, we produced the complete chloroplast genomes of F. suspensa using Illumina sequencing technology. F. suspensa is the first sequenced member within the genus Forsythia (Oleaceae). The gene order and organization of the chloroplast genome of F. suspensa are similar to other Oleaceae chloroplast genomes. The F. suspensa chloroplast genome is 156,404 bp in length, exhibits a conserved quadripartite structure with a large single-copy (LSC; 87,159 bp) region, and a small single-copy (SSC; 17,811 bp) region interspersed between inverted repeat (IRa/b; 25,717 bp) regions. A total of 114 unique genes were annotated, including 80 protein-coding genes, 30 tRNA, and four rRNA. The low GC content (37.8%) and codon usage bias for A- or T-ending codons may largely affect gene codon usage. Sequence analysis identified a total of 26 forward repeats, 23 palindrome repeats with lengths >30 bp (identity > 90%), and 54 simple sequence repeats (SSRs) with an average rate of 0.35 SSRs/kb. We predicted 52 RNA editing sites in the chloroplast of F. suspensa, all for C-to-U transitions. IR expansion or contraction and the divergent regions were analyzed among several species including the reported F. suspensa in this study. Phylogenetic analysis based on whole-plastome revealed that F. suspensa, as a member of the Oleaceae family, diverged relatively early from Lamiales. This study will contribute to strengthening medicinal resource conservation, molecular phylogenetic, and genetic engineering research investigations of this species.


Introduction
Forsythia suspensa (Thunb.) Vahl, known as "Lianqiao" in Chinese, is a well-known traditional Asian medicine that is widely distributed in many Asian and European countries [1]. In folk medicine, the extract of the dried fruit has long been used to treat a variety of diseases, such as inflammation, pyrexia, gonorrhea, tonsillitis, and ulcers [2]. In recent years, the dried ripe fruit of F. suspensa has often been prescribed for the treatment of diabetes in China [3,4].
Chloroplast (cp) genomes are mostly circular DNA molecules, which have a typical quadripartite structure composed of a large single copy (LSC) region and a small single copy (SSC) region interspersed between two copies of inverted repeats (IRa/b) [5]. The cp genome sequences can provide vast information not only about genes and their encoded proteins, but also on functional implications and evolutionary relationships [6]. Due to high-throughput capabilities and relatively low costs, next-generation sequencing techniques have made it more convenient to obtain a large number of cp genome sequences [7]. After the first complete cp DNA sequences were reported in Nicotiana tabacum [8] and Marchantia polymorpha [9], complete cp DNA sequences of numerous plant species were determined [6,[10][11][12]. To date, approximately 1300 plant cp genomes are publicly available as part of the National Center for Biotechnology Information (NCBI) database.
Within the Oleaceae family, the complete cp genomes of several plant species have been published [12][13][14][15], thereby providing additional evidence for the evolution and conservation of cp genomes. Nevertheless, no cp genome belonging to genus Forsythia has been reported. Few data are available with respect to the F. suspensa cp genome.
In order to characterize the complete cp genome sequence of the F. suspensa and expand our understanding of the diversity of the genus Forsythia, details of the cp genome structure and organization are reported in this paper. This is also the first sequenced member of the genus Forsythia (Oleaceae). We compare the F. suspense cp genome with previously annotated cp genomes of other Lamiales species. Our studies could provide basic data for the medicinal species conservation and molecular phylogenetic research of the genus Forsythia and Lamiales.

Genome Features
Whole genome sequencing using an Illumina Hiseq 4000 PE150 platform generated 19,241,634 raw reads. Clean reads were obtained by removing adaptors and low-quality read pairs. Then, we collected 662,793 cp-genome-related reads (3.44% of total reads), reaching an average of 636 × coverage over the cp genome. With PCR-based experiments, we closed the gaps and validated the sequence assembly, and ultimately obtained a complete F. suspensa cp genome sequence, which was then submitted to GenBank (accession number: MF579702).
Most cp genomes of higher plants have been found to have a typical quadripartite structure composed of an LSC region and an SSC region interspersed between the IRa/b region [5]. The complete cp genome of F. suspensa has a total length of 156,404 bp, with a pair of IRs of 25,717 bp that separate an LSC region of 87,159 bp and an SSC region of 17,811 bp ( Figure 1). The total GC content was 37.8%, which was similar to the published Oleaceae cp genomes [12][13][14][15]. The GC content of the IR regions was 43.2%, which was higher when compared with the GC content in the LSC and SSC regions (35.8% and 31.8%, respectively). The gene content and sequence of the F. suspensa cp genome are relatively conserved, with basic characteristics of land plant cp genomes [16]. It encodes a total of 114 unique genes, of which 19 are duplicated in the IR regions. Out of the 114 genes, there are 80 protein-coding genes (70.2%), 30 tRNA (26.3%), and four rRNA genes (rrn5, rrn4.5, rrn16, rrn23) (3.5%) ( Table 1). Eighteen genes contained introns, fifteen (nine protein-coding and six tRNA genes) of which contained one intron and three of which (rps12, ycf3, and clpP) contained two introns ( Table 2). The rps12 gene is a trans-spliced gene, three exons of which were located in the LSC region and IR regions, respectively. The complete gene of matK was located within the intron of trnK-UUU. One pseudogene (non functioning duplications of functional genes), ycf1, was identified, located in the boundary regions between IRb/SSC. The partial gene duplication might have caused the lack of protein-coding ability. In general, the junctions between the IR and LSC/SSC regions vary among higher plant cp genomes [17][18][19]. In the F. suspensa cp genome, the ycf1 gene regions extended into the IR region in the IR/SSC junctions, while the rpl2 was 51 bp apart from the LSC/IR junction. The gene content and sequence of the F. suspensa cp genome are relatively conserved, with basic characteristics of land plant cp genomes [16]. It encodes a total of 114 unique genes, of which 19 are duplicated in the IR regions. Out of the 114 genes, there are 80 protein-coding genes (70.2%), 30 tRNA (26.3%), and four rRNA genes (rrn5, rrn4.5, rrn16, rrn23) (3.5%) ( Table 1). Eighteen genes contained introns, fifteen (nine protein-coding and six tRNA genes) of which contained one intron and three of which (rps12, ycf3, and clpP) contained two introns ( Table 2). The rps12 gene is a trans-spliced gene, three exons of which were located in the LSC region and IR regions, respectively. The complete gene of matK was located within the intron of trnK-UUU. One pseudogene (non functioning duplications of functional genes), ycf1, was identified, located in the boundary regions between IRb/SSC. The partial gene duplication might have caused the lack of protein-coding ability. In general, the junctions between the IR and LSC/SSC regions vary among higher plant cp genomes [17][18][19]. In the F. suspensa cp genome, the ycf1 gene regions extended into the IR region in the IR/SSC junctions, while the rpl2 was 51 bp apart from the LSC/IR junction.

Comparison to Other Lamiales Species
The IR regions are highly conserved and play an important role in stabilizing the cp genome structure [20,21]. For IR and SC boundary regions, their expansion and contraction are commonly considered as the main mechanism behind the length variation of angiosperm cp genomes [22,23]. In this study, we compared the junctions of LSC/IRb/SSC/IRa of the seven Lamiales cp genomes (Figure 2), and also observed the expansions and contractions in IR boundary regions.

Comparison to Other Lamiales Species
The IR regions are highly conserved and play an important role in stabilizing the cp genome structure [20,21]. For IR and SC boundary regions, their expansion and contraction are commonly considered as the main mechanism behind the length variation of angiosperm cp genomes [22,23]. In this study, we compared the junctions of LSC/IRb/SSC/IRa of the seven Lamiales cp genomes ( Figure  2), and also observed the expansions and contractions in IR boundary regions. The rps19 genes of four Oleaceae species were all completely located in the LSC region, and the IR region expanded to the rps19 gene in the other three genomes, with a short rps19 pseudogene of 43 bp, 30 bp, and 40 bp created at the IRa/LSC border in S. miltiorrhiza, S. indicum, and S. takesimensis, respectively. The border between the IRb and SSC extended into the ycf1 genes, with ycf1 pseudogenes created in all of the seven species. The length of the ycf1 pseudogene was very similar in four of the Oleaceae species (1091 or 1092 bp), and was longer than that in S. miltiorrhiza (1056 bp), S. indicum (1012 bp), and S. takesimensis (886 bp). Overlaps were detected between the ycf1 pseudogene and the ndhF gene in five cp genomes (except for S. indicum and S. takesimensis), which also had similar lengths (25 or 26 bp) in four Oleaceae species. The trnH-GUG genes were all located in the LSC region, the distance of which from the LSC/IRa boundary was 3-22 bp. Overall, the IR/SC junctions of the Oleaceae species were similar and showed some difference compared to those of Lamiaceae (S. miltiorrhiza), Pedaliaceae (S. indicum), and Scrophulariaceae (S. takesimensis). Our results suggested that the cp genomes of closely related species might be conserved, whereas greater diversity might occur among species belonging to different families, such as one inverted repeat loss in the cp genome of Astragalus membranaceus [24] and the large inversions in Eucommia ulmoides [25].

Codon Usage Analysis
The synonymous codons often have different usage frequencies in plant genomes, which was termed codon usage bias. A variety of evolutionary factors which affect gene mutation and selection may lead to the occurrence of codon bias [26,27].
To examine codon usage, the effective number of codons (Nc) of 52 protein-coding genes (PCGs) was calculated. The Nc values for each PCG in F. suspensa are shown in Table S2. Our results indicated that the Nc values ranged from 37.83 (rps14) to 54.75 (ycf3) in all the selected PCGs. Most Nc values were greater than 44, which suggested a weak gene codon bias in the F. suspensa cp genome. The rps14 gene was detected to exist in the most biased codon usage with the lowest mean Nc value of The rps19 genes of four Oleaceae species were all completely located in the LSC region, and the IR region expanded to the rps19 gene in the other three genomes, with a short rps19 pseudogene of 43 bp, 30 bp, and 40 bp created at the IRa/LSC border in S. miltiorrhiza, S. indicum, and S. takesimensis, respectively. The border between the IRb and SSC extended into the ycf1 genes, with ycf1 pseudogenes created in all of the seven species. The length of the ycf1 pseudogene was very similar in four of the Oleaceae species (1091 or 1092 bp), and was longer than that in S. miltiorrhiza (1056 bp), S. indicum (1012 bp), and S. takesimensis (886 bp). Overlaps were detected between the ycf1 pseudogene and the ndhF gene in five cp genomes (except for S. indicum and S. takesimensis), which also had similar lengths (25 or 26 bp) in four Oleaceae species. The trnH-GUG genes were all located in the LSC region, the distance of which from the LSC/IRa boundary was 3-22 bp. Overall, the IR/SC junctions of the Oleaceae species were similar and showed some difference compared to those of Lamiaceae (S. miltiorrhiza), Pedaliaceae (S. indicum), and Scrophulariaceae (S. takesimensis). Our results suggested that the cp genomes of closely related species might be conserved, whereas greater diversity might occur among species belonging to different families, such as one inverted repeat loss in the cp genome of Astragalus membranaceus [24] and the large inversions in Eucommia ulmoides [25].

Codon Usage Analysis
The synonymous codons often have different usage frequencies in plant genomes, which was termed codon usage bias. A variety of evolutionary factors which affect gene mutation and selection may lead to the occurrence of codon bias [26,27].
To examine codon usage, the effective number of codons (Nc) of 52 protein-coding genes (PCGs) was calculated. The Nc values for each PCG in F. suspensa are shown in Table S2. Our results indicated that the Nc values ranged from 37.83 (rps14) to 54.75 (ycf3) in all the selected PCGs. Most Nc values were greater than 44, which suggested a weak gene codon bias in the F. suspensa cp genome. The rps14 gene was detected to exist in the most biased codon usage with the lowest mean Nc value of 37.83. Table 3 showed the codon usage and relative synonymous codon usage (RSCU). Due to the RSCU values of >1, thirty codons showed the codon usage bias in the F. suspensa cp genes. Interestingly, out of the above 30 codons, twenty-nine were A or T-ending codons. Conversely, the G + C-ending codons exhibited the opposite pattern (RSCU values <1), indicating that they are less common in F. suspensa cp genes. Stop codon usage was found to be biased toward TAA. The similar codon usage rules of bias for A-or T-ending were also found in poplar, rice, and other plants [28][29][30]. The value of relative synonymous codon usage (RSCU) > 1 are highlighted in bold.
The factors affecting codon usage may vary in different genes or species. In a relative study, Zhou et al. [30] considered the genomic nucleotide mutation bias as a main cause of codon bias in seed plants such as arabidopsis and poplar. Morton [31] reported that the cp gene codon usage was largely affected by the asymmetric mutation of cp DNA in Euglena gracilis. Our result suggested that a low GC content and codon usage bias for A + T-ending may be a major factor in the cp gene codon usage of F. suspensa.
The 52 unique PCGs comprised 63,555 bp that encoded 21,185 codons. The amino acid (AA) frequencies of the F. suspensa cp genome were further computed. Of these codons, 2237 (10.56%) encode leucine, which was the most frequency used AA in the F. suspensa cp genome ( Table 3). As the least common one, cysteine was only encoded by 223 (1.05%) codons.

Repeats and Simple Sequence Repeats Analysis
Repeat sequences in the F. suspensa cp genome were analyzed by REPuter and the results showed that there were no complement repeats and reverse repeats. Twenty-six forward repeats and 23 palindrome repeats were detected with lengths ≥30 bp (identity >90%) ( Table 4). Out of the 49 repeats, 34 repeats (69.4%) were 30-39 bp long, 11 repeats (22.4%) were 40-49 bp long, four repeats (8.2%) were 50-59 bp long, and the longest repeat was 58 bp. Generally, repeats were mostly distributed in noncoding regions [32,33]; however, 53.1% of the repeats in the F. suspensa cp genome were located in coding regions (CDS) (Figure 3A), mainly in ycf2; similar to that of S. dentata and S. takesimensis [34]. Meanwhile, 40.8% of repeats were located in intergenic spacers (IGS) and introns, and 6.1% of repeats were in parts of the IGS and CDS. Simple sequence repeats (SSRs) are widely distributed across the entire genome and exert significant influence on genome recombination and rearrangement [35]. As valuable molecular markers, SSRs have been used in polymorphism investigations and population genetics [36,37]. The occurrence, type, and distribution of SSRs were analyzed in the F. suspensa cp genome. In total, we detected 54 SSRs in the F. suspensa cp genome (Table 5), accounting for 700 bp of the total sequence (0.45%).
The majority of these SSRs consisted of mono-and di-nucleotide repeats, which were found 35 and seven times, respectively. Tri-(1), tetra-(4), and penta-nucleotide repeat sequences (1) were detected with a much lower frequency. Six compound SSRs were also found. Fifty SSRs (92.6%) were composed of A and T nucleotides, while tandem G or C repeats were quite rare, which was in concordance with the other research results [38,39]. Out of these SSRs, 42 (88.9%) and six (11.1%) were located in IGS and introns, respectively ( Figure 3B). Only five SSRs were found in the coding genes, including rpoC2, rpoA, and ndhD, and one was located in parts of the IGS and CDS. In addition, we noticed that almost all SSRs were located in LSC, except for (T)19, and no SSRs were detected in the IR region. These SSRs may be developed lineage-specific markers, which might be useful in evolutionary and genetic diversity studies.  Simple sequence repeats (SSRs) are widely distributed across the entire genome and exert significant influence on genome recombination and rearrangement [35]. As valuable molecular markers, SSRs have been used in polymorphism investigations and population genetics [36,37]. The occurrence, type, and distribution of SSRs were analyzed in the F. suspensa cp genome. In total, we detected 54 SSRs in the F. suspensa cp genome (Table 5), accounting for 700 bp of the total sequence (0.45%). The majority of these SSRs consisted of mono-and di-nucleotide repeats, which were found 35 and seven times, respectively. Tri-(1), tetra-(4), and penta-nucleotide repeat sequences (1) were detected with a much lower frequency. Six compound SSRs were also found. Fifty SSRs (92.6%) were composed of A and T nucleotides, while tandem G or C repeats were quite rare, which was in concordance with the other research results [38,39]. Out of these SSRs, 42 (88.9%) and six (11.1%) were located in IGS and introns, respectively ( Figure 3B). Only five SSRs were found in the coding genes, including rpoC2, rpoA, and ndhD, and one was located in parts of the IGS and CDS. In addition, we noticed that almost all SSRs were located in LSC, except for (T)19, and no SSRs were detected in the IR region. These SSRs may be developed lineage-specific markers, which might be useful in evolutionary and genetic diversity studies.

Predicted RNA Editing Sites in the F. suspensa Chloroplast Genes
In the F. suspensa cp genome, we predicted 52 RNA editing sites, which occurred in 21 genes ( Table 6). The ndhB gene contained the most editing sites (10), and this finding was consistent with other plants such as rice, maize, and tomato [40][41][42]. Meanwhile, the genes ndhD and rpoB were predicted to have six editing sites: matK, five; ropC2, three; accD, ndhA, ndhF, ndhG, and petB, two; and one each in atpA, atpF, atpI, ccsA, petG, psbE, rpl2, rpl20, rpoA, rps2, and rps14. All these editing sites were C-to-U transitions. The editing phenomenon was also commonly found in the chloroplasts and mitochondria of seed plants [43]. The locations of the editing sites in the first, second, and third codons were 14, 38, and 0, respectively. Of the 52 sites, twenty were U_A types, which was similar codon bias to previous studies of RNA editing sites [10,44]. In addition, forty-eight RNA editing events in the F. suspensa cp genome led to acid changes for highly hydrophobic residues, such as leucine, isoleucine, valine, tryptophan, and tyrosine. The conversions from serine to leucine were the most frequent transitions. As a form of post-transcriptional regulation of gene expression, the feature has already been revealed by most RNA editing researches [44]. Notably, our results provide additional evidence to support the above conclusion.   Table 6. Cont.

Phylogeny Reconstruction of Lamiales Based on Complete Chloroplast Genome Sequences
Complete cp genomes comprise abundant phylogenetic information, which could be applied to phylogenetic studies of angiosperm [11,45,46]. To identify the evolutionary position of F. suspensa within Lamiales, an improved resolution of phylogenetic relationships was achieved by using these whole cp genome sequences of 36 Lamiales species. Three species, C. Arabica, I. purpurea, and O. nivara were also chosen as outgroups. The Maximum likelihood (ML) bootstrap values were fairly high, with values ≥98% for 32 of the 36 nodes, and 30 nodes had 100% bootstrap support (Figure 4). F. suspensa, whose cp genome was reported in this study, was closely related to A. distichum, which then formed a cluster with H. palmeri, J. nudiflorum, and the Olea species from Oleaceae with 100% bootstrap supports. Notably, Oleaceae diverged relatively early from the Lamiales lineage. In addition, four phylogenetic relationships were only supported by lower ML bootstrap values. This was possibly a result of less samples in these families. The cp genome is also expected to be useful in resolving the deeper branches of the phylogeny, along with the availability of more whole genome sequences. bootstrap supports. Notably, Oleaceae diverged relatively early from the Lamiales lineage. In addition, four phylogenetic relationships were only supported by lower ML bootstrap values. This was possibly a result of less samples in these families. The cp genome is also expected to be useful in resolving the deeper branches of the phylogeny, along with the availability of more whole genome sequences.

Plant Materials
Samples of F. suspensa were collected in Zezhou County, Shanxi Province, China. The voucher specimens were deposited in the Herbarium of Shanxi Agricultural University, Taigu, China. Additionally, the location of the specimens was not within any protected area.

DNA Library Preparation, Sequencing, and Genome Assembly
Genomic DNA was extracted from fresh young leaves of the F. suspensa plant using the mCTAB method [47]. Genomic DNA was fragmented into 400-600 bp using a Covaris M220 Focused-ultrasonicator (Covaris, Woburn, MA, USA). Library preparation was conducted using NEBNext ® Ultra™ DNA Library Prep Kit Illumina (New England, Biolabs, Ipswich, MA, USA). Sample sequencing was carried out on an Illumina Hiseq 4000 PE150 platform.

Codon Usage
To ensure sampling accuracy, only 52 PCGs with a length >300 bp were selected for synonymous codon usage analysis. Two relevant parameters, Nc and RSCU, were calculated using the program CodonW1.4.2 (Available online: http://downloads.fyxm.net/CodonW-76666.html). Nc is often utilized to evaluate the codon bias at the individual gene level, in a range from 20 (extremely biased) to 61 (totally unbiased) [58]. RSCU is the observed frequency of a codon divided by the expected frequency. The values close to 1.0 indicate a lack of bias [59]. AA frequency was also calculated and expressed by the percentage of the codons encoding the same amino acid divided by the total codons.

Prediction of RNA Editing Sites
Prep-Cp [60] (Available online: http://prep.unl.edu/) and CURE software [61] (Available online: http://bioinfo.au.tsinghua.edu.cn/pure/) were applied to the prediction of RNA editing sites, and the parameter threshold (cutoff value) was set to 0.8 to ensure prediction accuracy.

Phylogenomic Analyses
ML phylogenetic analyses were performed using the F. suspensa complete cp genome and 32 Lamiales plastomes with three species, Coffea arabica, Ipomoea purpurea, and Oryza nivara, as outgroups (Table S1). All of the plastome sequences were aligned using MAFFT program version 7.0 [62] (Available online: http://mafft.cbrc.jp/alignment/server/index.html) and adjusted manually where necessary. These plastome nucleotide alignments were subjected to ML phylogenetic analyses with MEGA7.0 [63] based on the General Time Reversible model. A discrete Gamma distribution was used to model evolutionary rate differences among sites. The branch support was estimated by rapid bootstrap analyses using 100 pseudo-replicates.

Conclusions
The cp genome of the medicinal plant F. suspensa was reported for the first time in this study and its organization is described and compared with that of other Lamiales species. This genome is 156,404 bp in length, with a similar quadripartite structure and genomic contents common to most land plant genomes. The low GC content of the cp genome might caused the codon usage bias toward A-or T-ending codons. All of the predicted RNA editing sites in the genome were C-to-U transitions. Among several relative species, the genome size and IR expansion or contraction exhibited some differences, and the divergent regions were also analyzed. Repeat sequences and SSRs within F. suspensa were analyzed, which may be useful in developing molecular markers for the analyses of infraspecific genetic differentiation within the genus Forsythia (Oleaceae). Phylogenetic analysis based on the entire cp genome revealed that F. suspensa, as a member of the Oleaceae family, diverged relatively early from Lamiales. Overall, the sequences and annotation of the F. suspensa cp genome will facilitate medicinal resource conservation, as well as molecular phylogenetic and genetic engineering research of this species.