The Complete Chloroplast Genome Sequence of the Medicinal Plant Swertia mussotii Using the PacBio RS II Platform

Swertia mussotii is an important medicinal plant that has great economic and medicinal value and is found on the Qinghai Tibetan Plateau. The complete chloroplast (cp) genome of S. mussotii is 153,431 bp in size, with a pair of inverted repeat (IR) regions of 25,761 bp each that separate an large single-copy (LSC) region of 83,567 bp and an a small single-copy (SSC) region of 18,342 bp. The S. mussotii cp genome encodes 84 protein-coding genes, 37 transfer RNA (tRNA) genes, and eight ribosomal RNA (rRNA) genes. The identity, number, and GC content of S. mussotii cp genes were similar to those in the genomes of other Gentianales species. Via analysis of the repeat structure, 11 forward repeats, eight palindromic repeats, and one reverse repeat were detected in the S. mussotii cp genome. There are 45 SSRs in the S. mussotii cp genome, the majority of which are mononucleotides found in all other Gentianales species. An entire cp genome comparison study of S. mussotii and two other species in Gentianaceae was conducted. The complete cp genome sequence provides intragenic information for the cp genetic engineering of this medicinal plant.


Introduction
Swertia mussotii Franch (Zang Yin Chen, in Tibetan medicine) belongs to the family Gentianaceae.This species grows on the Qinghai Tibetan Plateau at an elevation of 3800-5000 m.To date, several pharmaceutically-active compounds have been isolated and structurally identified from the whole S. mussotii plant, including oleanolic acid, ursolic acid, mangiferin, swertiamarin, and gentiopicroside [1][2][3][4].Modern pharmacological research has demonstrated that these compounds have anti-hepatitis activity [5][6][7].Due to the overexploitation of this plant, S. mussotii as a wild resource has become rare.S. mussotii seeds only germinate poorly when planted at low elevations.
Chloroplasts originated from the interaction of photosynthetic bacteria with non-photosynthetic hosts through endosymbiosis [8].Chloroplasts are photosynthetic organelles that synthesise starch, amino acids, pigments, and fatty acids [9,10].The chloroplast has its own genome, and a typical circular cp genome is composed of four parts: a large single-copy (LSC) region, a small single-copy (SSC) region, and two inverted repeat (IR) regions.The majority of angiosperm cp genomes are highly conserved in gene content and order [11].However, large-scale genome rearrangement and gene loss have been identified in several angiosperm lineages [12,13].
The third-generation sequencing platform, PacBio, based on single-molecule, real-time (SMRT) sequencing technology, generates average read lengths of over 10 kb, with half of the reads over 20 kb and a maximum read length reaching up to 60 kb, using the newest P6-C4 chemical reagents on the current PacBio RS II machine.In addition to its extraordinarily long read length, this platform provides uniform coverage across GC-abnormal regions because no PCR amplification is required during the library construction [14,15].Many concerns have concentrated on the high rates of random error in single-pass reads (approximately 11% to 14%) [15].However, this can be improved given sufficient sequencing depth [15].Additionally, the optimisation of the PacBio assembly algorithm [16][17][18] has made this platform widely applied in de novo genome sequencing [19,20], as well as full-length transcriptome sequencing [21,22], for a growing number of species.
Due to the low GC content and the IR regions, it is difficult to use short reads from second-generation sequencing to recover a single contig spanning the whole cp genome [14].Using PacBio, long reads can greatly reduce the complexity of the assembly, and PacBio has already been successfully applied in many chloroplast genome sequencing projects, including Ananas comosus var.comosus [23], Aconitum barbatum var.puberulum [24], Beta vulgaris [25], and Gentiana straminea [26].Meanwhile, comparative studies among the three generations of sequencing technologies (Sanger, Illumina and PacBio) have demonstrated the reliability and accuracy of SMRT sequencing [27,28].
Currently, more than 1000 complete cp genome sequences have been deposited in the NCBI Organelle Genome Resources [29].However, few reports have been published on the genetic diversity of cpDNA from Gentianaceae [26].The chloroplast genome sequences of two members of the Gentianaceae, Gentiana straminea [26] and Gentiana crassicaulis, have been analysed.Here, we report the complete cp genome sequence of S. mussotii as determined using PacBio technology.Comparative sequence analysis was conducted among published Gentianaceae cp genomes.

Features of the S. mussotii Chloroplast Genome
The complete cp genome of S. mussotii is 153,431 bp in size, with a pair of IR regions of 25,761 bp that separate an LSC region of 83,567 bp from an SSC region of 18,342 bp (Table 1 and Figure 1).The overall GC content of the S. mussotii cp genome is 38.2%, with the IR regions possessing higher GC content (43.5%) than the LSC (36.2%) and SSC regions (31.9%) (Table 1).The high GC content of the IR regions is caused by the high GC content of the four ribosomal RNA (rRNA) genes (55.2%) present in this region [30].The S. mussotii cp genome encodes 84 protein-coding genes, 37 transfer RNA (tRNA) genes, and eight rRNA genes (Table 2).Seven protein-coding, seven tRNA, and all rRNA genes are duplicated in the IR regions.The non-coding regions constitute 41.6% of the genome, including introns, pseudogenes, and intergenic spacers; coding regions constitute 58.4%.The presence of one or two asterisks after the name of a gene indicates that that gene contains one or two introns, respectively.
There are five pseudogenes, i.e., accD, rps16, infA, rps19, and ycf1.The accD gene in S. mussotii contains internal stop codons.The accD gene also exists as a pseudogene in Jasminum nudiflorum and Trachelium caeruleum, but it is a normal gene in G. straminea.The rps16 gene lacks exon 2, a phenomenon that has been observed in related species.In S. mussotii, rps16 is a pseudogene, whereas in Syzygium cumini, Eucalyptus globulus, and Gossypium barbadense, the rps16 gene encodes a 16S ribosomal protein [31].The absence or incompleteness of this gene has also been reported in other plants [32,33].The infA gene is 3′ truncated, though it is a normal gene in many other cp genomes [34,35].The presence of one or two asterisks after the name of a gene indicates that that gene contains one or two introns, respectively.
There are five pseudogenes, i.e., accD, rps16, infA, rps19, and ycf1.The accD gene in S. mussotii contains internal stop codons.The accD gene also exists as a pseudogene in Jasminum nudiflorum and Trachelium caeruleum, but it is a normal gene in G. straminea.The rps16 gene lacks exon 2, a phenomenon that has been observed in related species.In S. mussotii, rps16 is a pseudogene, whereas in Syzygium cumini, Eucalyptus globulus, and Gossypium barbadense, the rps16 gene encodes a 16S ribosomal protein [31].The absence or incompleteness of this gene has also been reported in other plants [32,33].The infA gene is 3 1 truncated, though it is a normal gene in many other cp genomes [34,35].
The S. mussotii cp genome has 17 intron-containing genes, of which three (clpP, rps12, and ycf3) contain two introns (Table 3).The rps12 gene is a trans-spliced gene with the 5 1 end located in the LSC region and the duplicated 3 1 end located in the IR regions.trnK-UUU has the largest intron, which contains the matK gene.Together, all of the genes of S. mussotii are encoded by 25,731 codons.Among these, leucine, with 2769 (10.7%) of the codons, is the most frequent amino acid in the genome, and cysteine, with 292 (1.1%), is the least frequent (Table 4).Within the protein-coding regions (CDS), the percentages of AT content for the first, second, and third codon positions are 54.3%, 61.3%, and 68.8%, respectively.The bias towards a higher AT representation at the third codon position has also been observed in other plant cp genomes [36,37].

Repeat Analysis
Repeat structure analysis revealed the presence of 11 forward repeats, eight palindromic repeats, and one reverse repeat in the S. mussotii cp genome (Table 5).The repeats were mostly distributed in the intergenic spacer (IGS) and intron sequences.We analysed the repeats of several other species in Gentianales (Figure 2).Interestingly, this comparison revealed that the longest repeats in the five Gentianales cp genomes were 30-39 bp, and the Oncinotis tenuiloba cp genome contained the greatest total number of repeats (54).Chloroplast simple sequence repeats (SSRs) have been accepted as effective molecular markers [38,39].There were 45 SSRs in the S. mussotii cp genome (Table 6), the majority of which were mononucleotides (30) that we found in all the other species [40].Pentanucleotides and hexanucleotides were rarely found in the Gentianales cp genomes (Table 7).Most SSR loci were located in LSC regions.In all species, the majority of the tri-to hexanucleotides were AT-rich.An average of 62% of all SSRs in the Gentianales cp genomes were A/T mononucleotides.These results are consistent with the view that SSRs in cp genomes contribute to AT richness [41,42].CDS: coding regions.a Percentages were calculated using the total length of the CDS divided by the genome size.b Total number of SSRs identified in the CDS.c Percentages were calculated using the total number of SSRs in the CDS divided by the total number of SSRs in the genome.

Comparative Chloroplast Genomic Analysis
The whole cp genome sequence of S. mussotii was compared to those of G. straminea and G. crassicaulis.The cp genome of S. mussotii is the longest of the three cp genomes, measuring approximately 4.4 kb and 4.7 kb longer than those of G. straminea and G. crassicaulis, respectively.There are no significant differences in sequence length between the SSCs or the IRs, and the variation in sequence length is mainly attributable to the difference in the length of the LSC region (Table S2) [40].
The overall sequence identity of the three Gentianaceae cp genomes was plotted using mVISTA, with the annotation of S. mussotii as a reference (Figure 3).The comparison shows that the two IR regions are less divergent than the LSC and SSC regions.Additionally, the coding regions are more conserved than the non-coding regions [26], and the highly divergent regions among the three cp genomes occur in the non-coding regions, including ndhD-ccsA, ndhI-ndhG, and trnH-psbA.Similar results have been observed in other plant cp genomes [26,43].In our study, we observed that all four rRNA genes are the most conserved, while the most divergent coding regions are the clpP, rpl22, ycf1, rpl32, ycf15, and matK genes.The divergent portions of non-coding regions of cp genomes have proven useful for phylogenetic analysis [44,45].A cut-off of 70% identity was used for the plots, and the y-axis represents the percent identity between 50%-100%.Genome regions are color-coded as protein-coding (exon), rRNA, tRNA, and conserved noncoding sequences (CNS).

IR Contraction and Expansion
IR contraction was observed at the junction of the IR and LSC regions of the S. mussotii cp genome.This contraction has also been found in the twelve species of Gentianales analysed (G.straminea, G. crassicaulis, C. arabica, C. roseus, A. nivea, A. syriaca, R. stricta, E. umbellatus, N. oleander, O. tenuiloba, P. luteum, and G. officinalis) (Figure 4).In all of these species, the IRA/SSC junction is situated in the coding region of the ycf1 gene, resulting in the duplication of the 3′ end of this gene.This duplication Figure 3.Comparison of three chloroplast genomes using mVISTA.Grey arrows and thick lines above the alignment indicate genes with their orientation and the position of the IRs, respectively.A cut-off of 70% identity was used for the plots, and the y-axis represents the percent identity between 50%-100%.Genome regions are color-coded as protein-coding (exon), rRNA, tRNA, and conserved noncoding sequences (CNS).

IR Contraction and Expansion
IR contraction was observed at the junction of the IR and LSC regions of the S. mussotii cp genome.This contraction has also been found in the twelve species of Gentianales analysed (G.straminea, G. crassicaulis, C. arabica, C. roseus, A. nivea, A. syriaca, R. stricta, E. umbellatus, N. oleander, O. tenuiloba, P. luteum, and G. officinalis) (Figure 4).In all of these species, the IRA/SSC junction is situated in the coding region of the ycf1 gene, resulting in the duplication of the 3 1 end of this gene.This duplication produces a pseudogene of variable length at the IRB/SSC border.The lengths of the ycf1 pseudogenes varied from 945 bp to 1426 bp.In addition, the ycf1 pseudogene and the ndhF gene overlapped in S. mussotii, G. straminea, G. crassicaulis, N. oleander, and R. stricta by 54 bp, 54 bp, 54 bp, 62 bp, and 3 bp, respectively.The IRb/LSC border is located in the coding region of rps19 in all the compared plants, except for A. nivea, A. syriaca, and G. officinalis.rps19 pseudogenes of various lengths were also found at the IRa/LSC borders in S. mussotii, G. straminea, G. crassicaulis, C. arabica, C. roseus, G. officinalis, and R. stricta.S. mussotii had the longest rps19 pseudogene, at 199 bp in length.The trnH genes of these thirteen species were all located in the LSC region, 0-82 bp away from the IRa/LSC border.In the cp genome, the IR/LSC boundaries are not static, but are subject to a dynamic and random processes that allow conservative expansions and contractions [46].

DNA Sequencing, Genome Assembly, and Validation
Fresh leaves were collected from S. mussotii in Yushu County, Qinghai Province.Total DNA was extracted using the NuClean PlantGen DNA Kit (CWBIO, Beijing, China) and was used to construct

DNA Sequencing, Genome Assembly, and Validation
Fresh leaves were collected from S. mussotii in Yushu County, Qinghai Province.Total DNA was extracted using the NuClean PlantGen DNA Kit (CWBIO, Beijing, China) and was used to construct an SMRT sequencing library with an insert size of 10 kb.The genome was sequenced using the PacBio RS II platform (Pacific Biosciences, Menlo Park, CA, USA) at the Institute of Medicinal Plant Development of the Chinese Academy of Medical Sciences.We assembled the cp genome of S. mussotii as follows: first, the PacBio reads were error-corrected and assembled to produce the initial contigs using the hierarchical genome assembly process (HGAP) of SMRT Analysis (Pacific Biosciences); then, the coverage for each contig was calculated by mapping the PacBio reads to these initial contigs using BLASR [47], and contigs either showing similarity to the closely-related cp genome sequences or exhibiting similar coverage were extracted; finally, the complete cp genome was constructed by assembling these contigs.Based on the BLASR results, 3904 PacBio reads were used in the assembly of the complete cp genome, with a total length of 46,037,271 bp, thus yielding a 300ˆdepth of the cp genome.Four junction regions between IRs and LSC/SSC were verified by PCR amplifications and Sanger sequencing.The final cp genome of S. mussotii was submitted to GenBank under the accession number KU641021.

Genome Annotation and Codon Usage
DOGMA [48] was used to annotate the cp genome, followed by manual corrections.The tRNA genes were identified using tRNAscan-SE [49].The circular genome map was drawn using OGDRAW [50].Codon usage and GC content were analysed using MEGA5 [51].

Genome Comparison and Repeat Analyses
mVISTA [52,53] was used to compare the cp genome of S. mussotii with two other cp genomes using the annotation of S. mussotii as a reference.
Repeats (forward, palindromic, reverse, and complement) and simple sequence repeats (SSRs) were identified using REPuter [54] and MISA, respectively, with the same parameters as described in Ni et al. [26].

Conclusions
This is the first study to analyse the complete cpDNA sequence of S. mussotii.The chloroplast genome structure and composition of S. mussotii are similar to those reported for other Gentianaceae.In addition, the distributions and locations of repeated sequences were determined.All of these repeats, together with the aforementioned SSRs, are informative sources for the exploration of new molecular markers.Studying the cp genome facilitates the identification of the optimal intergenic spacers for transgene integration and the development of site-specific cp transformation vectors in chloroplast genetic engineering.To date, many transgenes have been successfully introduced into the plastid genomes of the tobacco model species and of selected other important crop plants [55,56].The feasibility of metabolic engineering in transgenic plastids has been demonstrated for several nutritionally important biochemical pathways, including carotenoid biosynthesis [57] and fatty acid biosynthesis [58,59].With the details of the bioactive compound synthesis pathway in S. mussotii having been described [60], there is no doubt that plastid engineering holds great potential in secondary metabolic engineering to enhance the production of pharmaceutically active compounds.

Figure 1 .
Figure 1.Gene map of the S. mussotii chloroplast genome.Genes drawn inside the circle are transcribed clockwise, and those outside are counterclockwise.Genes are colour-coded based on the functional groups to which they belong.CDS: protein-coding regions.

Figure 1 .
Figure 1.Gene map of the S. mussotii chloroplast genome.Genes drawn inside the circle are transcribed clockwise, and those outside are counterclockwise.Genes are colour-coded based on the functional groups to which they belong.CDS: protein-coding regions.

Figure 2 .
Figure 2. Repeat sequences in six Gentianales chloroplast genomes.REPuter was used to identify repeat sequences with length ≥ 30 bp and sequence identify ≥90% in the chloroplast genomes.F, P, R, and C indicate the repeat types F (forward), P (palindrome), R (reverse), and C (complement), respectively.Repeats with different lengths are indicated in different colours.

Figure 2 .
Figure 2. Repeat sequences in six Gentianales chloroplast genomes.REPuter was used to identify repeat sequences with length ě 30 bp and sequence identify ě90% in the chloroplast genomes.F, P, R, and C indicate the repeat types F (forward), P (palindrome), R (reverse), and C (complement), respectively.Repeats with different lengths are indicated in different colours.

Molecules 2016, 21 , 1029 8 of 13 Figure
FigureComparison of three chloroplast genomes using mVISTA.Grey arrows and thick lines above the alignment indicate genes with their orientation and the position of the IRs, respectively.A cut-off of 70% identity was used for the plots, and the y-axis represents the percent identity between 50%-100%.Genome regions are color-coded as protein-coding (exon), rRNA, tRNA, and conserved noncoding sequences (CNS).

Figure 4 .
Figure 4. Comparison of the borders of the LSC, SSC, and IR regions among thirteen chloroplast genomes.Ψ indicates a pseudogene.This figure is not to scale.

Figure 4 .
Figure 4. Comparison of the borders of the LSC, SSC, and IR regions among thirteen chloroplast genomes.Ψ indicates a pseudogene.This figure is not to scale.

Table 1 .
Base composition in the S. mussotii chloroplast genome.

Table 2 .
Genes present in the S. mussotii chloroplast genome.

Table 2 .
Genes present in the S. mussotii chloroplast genome.

Table 3 .
The genes with introns in the S. mussotii chloroplast genome and the lengths of the exons and introns.
* The rps12 gene is a trans-spliced gene with the 5 1 end located in the LSC region and the duplicated 3 1 end in the IR region.

Table 4 .
The codon-anticodon recognition pattern and codon usage for the S. mussotii chloroplast genome.

Table 5 .
Repeat sequences and their distribution in the S. mussotii chloroplast genome.

Table 6 .
Simple sequence repeats in the S. mussotii chloroplast genome.

Table 6 .
Simple sequence repeats in the S. mussotii chloroplast genome.

Table 7 .
Distribution of SSRs present in the Gentianales chloroplast genomes.