The Complete Chloroplast Genome of a Key Ancestor of Modern Roses, Rosa chinensis var. spontanea, and a Comparison with Congeneric Species

Rosa chinensis var. spontanea, an endemic and endangered plant of China, is one of the key ancestors of modern roses and a source for famous traditional Chinese medicines against female diseases, such as irregular menses and dysmenorrhea. In this study, the complete chloroplast (cp) genome of R. chinensis var. spontanea was sequenced, analyzed, and compared to congeneric species. The cp genome of R. chinensis var. spontanea is a typical quadripartite circular molecule of 156,590 bp in length, including one large single copy (LSC) region of 85,910 bp and one small single copy (SSC) region of 18,762 bp, separated by two inverted repeat (IR) regions of 25,959 bp. The GC content of the whole genome is 37.2%, while that of LSC, SSC, and IR is 42.8%, 35.2% and 31.2%, respectively. The genome encodes 129 genes, including 84 protein-coding genes (PCGs), 37 transfer RNA (tRNA) genes, and eight ribosomal RNA (rRNA) genes. Seventeen genes in the IR regions were found to be duplicated. Thirty-three forward and five inverted repeats were detected in the cp genome of R. chinensis var. spontanea. The genome is rich in SSRs. In total, 85 SSRs were detected. A genome comparison revealed that IR contraction might be the reason for the relatively smaller cp genome size of R. chinensis var. spontanea compared to other congeneric species. Sequence analysis revealed that the LSC and SSC regions were more divergent than the IR regions within the genus Rosa and that a higher divergence occurred in non-coding regions than in coding regions. A phylogenetic analysis showed that the sampled species of the genus Rosa formed a monophyletic clade and that R. chinensis var. spontanea shared a more recent ancestor with R. lichiangensis of the section Synstylae than with R. odorata var. gigantea of the section Chinenses. This information will be useful for the conservation genetics of R. chinensis var. spontanea and for the phylogenetic study of the genus Rosa, and it might also facilitate the genetics and breeding of modern roses.


Introduction
Molecular data have suggested that Rosa chinensis Jacq. var. spontanea (Rehder et. Wilson) Yü et Ku is the maternal parent of R. chinensis and the possible paternal parent of R. odorata (Andrews) Sweet [1], which gave characters of recurrent flowering, tea scent, and multiple floral colors to modern roses [2,3]. As one of the key ancestors of modern roses, R. chinensis var. spontanea is not only a precious germplasm The complete cp genome of R. chinensis var. spontanea represents a typical quadripartite circular molecule that is 156,590 bp in length. It is composed by a LSC region of 85,910 bp and a SSC region of 18,762 bp, separated by two IR regions of 25,959 bp (Table 1 and Figure 1). The GC content of the total cp DNA sequence is 37.2%, similar to that of R. odorata (Andr.) Sweet var. gigantea (Crép) Rehd. et Wils.(KF753637) [13], R. praelucens Byhouwer (MG450565) [14] and R. roxburghii Tratt. (KX768420). The GC content of the IR regions is 42.8%, while the LSC and SSC regions exhibit lower GC content (35.2% and 31.2%, respectively) ( Table 1). The complete cp genome includes 57.8% coding sequences (50.2% PCGs, 1.8% tRNAs, and 5.8% rRNAs) and 42.2% non-coding sequences (11.8% introns and 30.4% intergenic spacers). Among PCGs, the AT content of the first, second, and third positions is 54.7%, 62.5%, and 69.7%, respectively (Table 1). This kind of bias towards a higher AT content at the third position of the codons is used to discriminate cp DNA from nuclear and mitochondrial DNA [15] and has been widely reported in other plant cp genomes [16][17][18].
The cp genome of R. chinensis var. spontanea contains 129 genes, including 84 PCGs, 37 tRNAs, and eight rRNAs (Table S1). Six PCGs (ndhB, rpl2, rpl23, rps7, rps12 and ycf2), four rRNAs (rrn16, rrn23, rrn 4.5 and rrn5) and seven tRNAs (trnA-UGC, trnI-CAU, trnI-GAU, trnL-CAA, trnN-GUU, trnR-ACG, and trnV-GAC) within the IR regions are completely duplicated. The LSC region contains 62 PCGs and 22 tRNAs. The SSC region contains one tRNA and 12 PCGs. Additionally, 14 genes, namely trnK-UUU, rps16, trnG-GCC, rpoC1, trnL-UAA, trnV-UAC, petB, rpl16, rp12, ndhB, trnI-GAU, trnA-UGC, ndhA, and petD, contain one intron, whereas the ycf3, rps12 and clpP genes contain two introns. Despite that, there are 17-20 group II introns within tRNA and protein-coding genes in land plant cp genomes [19], so far only the intron of trnL has been characterized as a group I intron in chloroplasts [20]. Thus, all these introns of R. chinensis var. spontanea, except the trnL-UAA intron, might be group II introns. The rps12 gene is trans-spliced in the cp genome of R. chinensis var. spontanea. C-terminal exon 2 and 3 of rps12 are located in the IR regions. Exon 1 is 28,259 bp downstream of the nearest copy of exons 2 and 3 while 72,017 bp away from the distal copy of exons 2 and 3 (Table S1). The trnK-UUU gene had the largest intron with a 2498 bp length, in which the matK gene was located. The matK gene encodes MatK, the maturase which is derived from reverse transcriptase and has been proved to be an essential splice factor for both the group I and group II introns [20,21].
Molecules 2018, 23, x FOR PEER REVIEW  3 of 14 trnA-UGC, ndhA, and petD, contain one intron, whereas the ycf3, rps12 and clpP genes contain two introns. Despite that, there are 17-20 group II introns within tRNA and protein-coding genes in land plant cp genomes [19], so far only the intron of trnL has been characterized as a group I intron in chloroplasts [20]. Thus, all these introns of R. chinensis var. spontanea, except the trnL-UAA intron, might be group II introns. The rps12 gene is trans-spliced in the cp genome of R. chinensis var. spontanea. C-terminal exon 2 and 3 of rps12 are located in the IR regions. Exon 1 is 28,259 bp downstream of the nearest copy of exons 2 and 3 while 72,017 bp away from the distal copy of exons 2 and 3 (Table S1).
The trnK-UUU gene had the largest intron with a 2498 bp length, in which the matK gene was located. The matK gene encodes MatK, the maturase which is derived from reverse transcriptase and has been proved to be an essential splice factor for both the group I and group II introns [20,21]. Based on the sequences of PCGs and tRNAs, the frequency of codon usage of the cp genome of R. chinensis var. spontanea was estimated (Table 2). In total, 27,525 codons were found in all the coding sequences. Among these, leucine is the most frequent amino acid, representing 10.4% (2,871) of the total codons, while cysteine is the least frequent one with 1.2% (320) of the codons. A-and U-ending codons are common. Except for trnL-CAA, trnS-GGA and a stop codon (UAG), all types of preferred synonymous codons (RSCU > 1) ended with A or U. Based on the sequences of PCGs and tRNAs, the frequency of codon usage of the cp genome of R. chinensis var. spontanea was estimated ( Table 2). In total, 27,525 codons were found in all the coding sequences. Among these, leucine is the most frequent amino acid, representing 10.4% (2,871) of the total codons, while cysteine is the least frequent one with 1.2% (320) of the codons. A-and U-ending codons are common. Except for trnL-CAA, trnS-GGA and a stop codon (UAG), all types of preferred synonymous codons (RSCU > 1) ended with A or U.

Repeat and SSR Analysis
For the repeat structure analysis, 33 forward and five inverted repeats with a minimal repeat size of 20 bp were detected in the cp genome of R. chinensis var. spontanea ( Table 3). Most of these repeats are between 20 and 30 bp. The longest forward repeat is 41 bp in length, located in the intergenic region between the genes psbE and petL. Most of the repeats were found in the LSC region. Among them, repeat No. 5 is related to trnS-GCU and trnS-UGA (Table 3). Repeat No. 7 is related to trnG-GCU and trnG-UCC. Repeat No. 13 is associated with psa genes. Six forward repeats were located in IR regions, including two repeats associated with ycf2 genes and one repeat related to the ndhB gene. In addition, there were several repeat pairs with either repeated sequence located in a distinct region, e.g., each of the two sequences of repeat No. 16, 25, and 26 are located in the gene introns of LSC and SSC, respectively. As chloroplast-specific SSRs are uniparentally inherited and exhibit a high level of intraspecific polymorphism, they are widely used in population genetics, species identification, evolutionary processes research of wild plants [22,23], and as markers for linkage map construction and the breeding of crop plants [24,25]. In total, 85 SSRs were identified in the cp genome of R. chinensis var. spontanea, most of which were detected in the LSC region (Table 4). Among them, 55 (64.7%) are mononucleotide SSRs, ten (11.8%) are dinucleotide SSRs, seven (8.2%) are trinucleotide SSRs, 10 (11.8%) are tetranucleotide SSRs, one (1.2%) is a pentanucleotide SSR, and two (2.4%) are hexanucleotide SSRs. Only 22 SSRs are located in genes and the others are in the intergenic regions. Fifty two (94.5%) of the mononucleotide SSRs belong to the A/T type, which is consistent with the hypothesis that cp SSRs are generally composed of short polyadenine (poly A) or polythymine (poly T) repeats and rarely contain tandem guanine (G) or cytosine (C) repeats. These cp SSR markers can be used in the conservation genetics of R. chinensis var. spontanea, as well as and in both the linkage map construction and molecular-marker-assisted selection of modern roses.

Comparative Analysis of the Chloroplast Genomes of the Genus Rosa
The complete cp genome sequence of R. chinensis var. spontanea was compared to that of R. odorata var. gigantea [13], R. roxburghii (KX768420) and R. praelucens (MG450565) [14]. Rosa chinensis var. spontanea has the smallest cp genome with the smallest IR region (25,959 bp), while R. praelucens has the largest cp genome with the largest LSC, at 86,313 bp (Table S2). No significant differences were found in the sequence lengths of SSC among the four species. The main reason for the length differences in cp genomes of different rose species is the size variation of the LSC and IR regions (Table S2).

Comparative Analysis of the Chloroplast Genomes of the Genus Rosa
The complete cp genome sequence of R. chinensis var. spontanea was compared to that of R. odorata var. gigantea [13], R. roxburghii (KX768420) and R. praelucens (MG450565) [14]. Rosa chinensis var. spontanea has the smallest cp genome with the smallest IR region (25,959 bp), while R. praelucens has the largest cp genome with the largest LSC, at 86,313 bp (Table S2). No significant differences were found in the sequence lengths of SSC among the four species. The main reason for the length differences in cp genomes of different rose species is the size variation of the LSC and IR regions (Table S2).
Sequence comparisons revealed that the LSC and the SSC regions were more divergent than the IR regions, and that higher divergence could be found in non-coding regions than in coding regions ( Figure 2). Significant variations could be found in coding regions of some genes including rps19 and ycf1. The highest divergence in non-coding regions was found in the intergenic regions of the trnK-rps16, rps16-trnQ, trnS-trnG, trnR-atpA, atpF-atpH, rps2-rpoC2, rpoB-trnC, trnC-petN, trnT-psbD, psbZ-trnG, rps4-trnT, psbE-petL, trnP-psaJ, ndhF-rpl32, and ccsA-ndhD. The introns of rpl2, rps16, ndhA, trnV, clpP, and ndhA were relatively highly divergent, too. These results might indicate that these regions evolve rapidly in the genus Rosa, as well as in other Rosaceae plants [26,27]. Complete chloroplast genome comparison of four rose species using the chloroplast genome of R. chinensis var. spontanea as a reference. The grey arrows and thick black lines above the alignment indicate the gene orientation. The y-axis represents the identity from 50% to 100%.

IR Contraction in the Chloroplast Genome of R. chinensis var. spontanea
Although IRs are the most conserved regions of the cp genomes, contraction and expansion at the borders of IR regions are common evolutionary events, and are hypothesized to be the main reason for size differences between cp genomes [28]. Detailed comparisons of the IR-SSC and IR-LSC size differences between cp genomes [28]. Detailed comparisons of the IR-SSC and IR-LSC boundaries among the cp genomes of the above four rose species were presented in Figure 3. IR regions are relatively highly conserved in the genus Rosa, but compared to other congeneric species, some position changes occurred in the IR/LSC regions of R. chinensis var. Spontanea. The rpl2 gene in the cp genome shifted by 31 bp from IRb to LSC at the LSC/IRb border, and that gene also shifted by 31 bp from IRa to LSC at the IRa/LSC border, indicating the IR contraction in the cp genome of this species. This contraction is mainly caused by the fragment deletions in the intergenic regions of the rps12-trnV, rrn4.5-rrn5, and trnR-trnN genes, and leads to the relatively smaller size of its IR regions and consequently a smaller size of the cp genome ( Figure 3, Table S2). some position changes occurred in the IR/LSC regions of R. chinensis var. Spontanea. The rpl2 gene in the cp genome shifted by 31 bp from IRb to LSC at the LSC/IRb border, and that gene also shifted by 31 bp from IRa to LSC at the IRa/LSC border, indicating the IR contraction in the cp genome of this species. This contraction is mainly caused by the fragment deletions in the intergenic regions of the rps12-trnV, rrn4.5-rrn5, and trnR-trnN genes, and leads to the relatively smaller size of its IR regions and consequently a smaller size of the cp genome ( Figure 3, Table S2).
Generally, the IRa/LSC border is located between the rpl2 and trnH genes in the rose family with rpl2 in IRa and trnH in LSC [27], like in R. roxburghii and R. odorata var. gigantea. The trnH gene of R. praelucens extends only one bp from LSC to IRa, but its LSC region was much larger than that of other species (Table S2). One 505 bp insertion in the intergenic region between the genes psbM and trnD was detected according to the result of the MAFFT alignment. This large insertion leads to the largest LSC region of R. praelucens and thus the largest cp genome among these four rose species. The extraction and contraction of the IR region at the IR-SSC boundaries among these species were not significant. Accordingly, the extension and contraction of IR regions at the IR/LSC borders, along with the large insertion/deletion in the LSC region, might be the main reason for the cp genome size variation in the genus Rosa.

Phylogenetic Analysis
There have been many attempts to reconstruct the phylogeny of the genus Rosa. Most of them suggested that the extant classification system was artificial [29,30] and that interspecies relationships within the genus remained ambiguous. The specific relationships within the sections Chinenses and Synstylae were still obscure due to limited sampling, low genetic variation of molecular markers, and complex evolutionary histories [31]. The availability of the complete cp genomes will provide additional informative data for the reconstruction of a robust phylogenetic model for the rose species. The phylogenetic tree (Figure 4) based on the LSC, SSC and one-IR regions in the cp genomes of 22 species from Rosaceae showed that species from Rosaceae were monophyletic and that the intra-family relationships were almost in compliance with that found by Zhang et al. [32]. Species from the genus Rosa formed a monophyletic clade with 100% support. The representative of Generally, the IRa/LSC border is located between the rpl2 and trnH genes in the rose family with rpl2 in IRa and trnH in LSC [27], like in R. roxburghii and R. odorata var. gigantea. The trnH gene of R. praelucens extends only one bp from LSC to IRa, but its LSC region was much larger than that of other species (Table S2). One 505 bp insertion in the intergenic region between the genes psbM and trnD was detected according to the result of the MAFFT alignment. This large insertion leads to the largest LSC region of R. praelucens and thus the largest cp genome among these four rose species. The extraction and contraction of the IR region at the IR-SSC boundaries among these species were not significant. Accordingly, the extension and contraction of IR regions at the IR/LSC borders, along with the large insertion/deletion in the LSC region, might be the main reason for the cp genome size variation in the genus Rosa.

Phylogenetic Analysis
There have been many attempts to reconstruct the phylogeny of the genus Rosa. Most of them suggested that the extant classification system was artificial [29,30] and that interspecies relationships within the genus remained ambiguous. The specific relationships within the sections Chinenses and Synstylae were still obscure due to limited sampling, low genetic variation of molecular markers, and complex evolutionary histories [31]. The availability of the complete cp genomes will provide additional informative data for the reconstruction of a robust phylogenetic model for the rose species. The phylogenetic tree (Figure 4) based on the LSC, SSC and one-IR regions in the cp genomes of 22 species from Rosaceae showed that species from Rosaceae were monophyletic and that the intra-family relationships were almost in compliance with that found by Zhang et al. [32]. Species from the genus Rosa formed a monophyletic clade with 100% support. The representative of the subgenus Hulthemia, R. persica Michx. [33,34], was a sister to the clade composed by the other five rose species, supporting the subgenus position of Hulthemia. In the subgenus Rosa, R. chinensis var. spontanea from section Chinenses was sister to R. lichiangensis from section Synstylae, and then clustered with another species from section Chinenses, R. odorata var. gigantean, confirming that R. sections Chinenses and Synstylae, defined in the traditional taxonomic system, shared a more recent ancestor and could be merged as one section in the genus Rosa [30]. the subgenus Hulthemia, R. persica Michx. [33,34], was a sister to the clade composed by the other five rose species, supporting the subgenus position of Hulthemia. In the subgenus Rosa, R. chinensis var. spontanea from section Chinenses was sister to R. lichiangensis from section Synstylae, and then clustered with another species from section Chinenses, R. odorata var. gigantean, confirming that R. sections Chinenses and Synstylae, defined in the traditional taxonomic system, shared a more recent ancestor and could be merged as one section in the genus Rosa [30].

DNA Sequencing and Chloroplast Genome Assembly
Dry leaves of R. chinensis var. spontanea collected from Yichang of Hubei (111°10′ E, 30°47′ N, 400 m) were used to extract the total genomic DNA. A shotgun library was prepared and sequenced using the Illumina Hiseq 2000 (Illumina, CA, USA) at Novogene (Beijing, China). Approximately 3.68 Gb raw data of 150 bp paired-end reads were generated. The raw reads were filtered to obtain high-quality clean reads by using NGS QC Toolkit v2.3.3 with default parameters [35]. The cp genome was de novo assembled using the GetOrganelle pipeline (https://github.com/Kinggerm/GetOrganelle).

Gene Annotation and Sequence Analysis
The genome was automatically annotated by using the CpGAVAS pipeline [36]. The annotation was adjusted and confirmed by Geneious 8.1 [37]. Sequence data was deposited into GenBank under the accession number MG523859. The circular cp map of R. chinensis var. spontanea was generated

DNA Sequencing and Chloroplast Genome Assembly
Dry leaves of R. chinensis var. spontanea collected from Yichang of Hubei (111 • 10 E, 30 • 47 N, 400 m) were used to extract the total genomic DNA. A shotgun library was prepared and sequenced using the Illumina Hiseq 2000 (Illumina, CA, USA) at Novogene (Beijing, China). Approximately 3.68 Gb raw data of 150 bp paired-end reads were generated. The raw reads were filtered to obtain high-quality clean reads by using NGS QC Toolkit v2.3.3 with default parameters [35]. The cp genome was de novo assembled using the GetOrganelle pipeline (https://github.com/Kinggerm/ GetOrganelle).

Gene Annotation and Sequence Analysis
The genome was automatically annotated by using the CpGAVAS pipeline [36]. The annotation was adjusted and confirmed by Geneious 8.1 [37]. Sequence data was deposited into GenBank under the accession number MG523859. The circular cp map of R. chinensis var. spontanea was generated by OGDRAW [38]. Codon usage analysis, calculation of relative synonymous codon usage values (RSCU), and measurement of AT content were carried out by using MEGA 6.06 [39].

Genome Comparison
MUMer [40] was used to perform pairwise sequence alignments of cp genomes. The mVISTA [41] program was applied to compare the complete cp genome of R. chinensis var. spontanea to the other published cp genomes of its congeneric species, i.e., R. odorata var. gigantea, R. roxburghii and R. praelucens, using the shuffle-LAGAN mode [42] and using the annotation of R. chinensis var. spontanea as reference.

Repeats and Simple Sequence Repeats (SSRs)
REPuter [43] was used to find forward and inverted tandem repeats ≥ 20 bp with a minimum alignment score and maximum period size of 100 and 500, respectively. The minimum identity of repeats was limited to 85% (Hamming distance of 3). IMEx [44] was used to identify SSRs with the minimum repeat number set to 10, 5, 4, 3, 3 and 3 for mono-, di-, tri-, tetra-, penta-and hexanucleotides, respectively.

Phylogenetic Analysis
To identify the phylogenetic position of R. chinensis var. spontanea in Rosa, 21 published cp genomes of Rosaceae were used to construct a phylogeny tree, using Berchemiella wilsonii (C. K. Schneid.) Nakai (Rhamnaceae) as the outgroup. The LSC, SSC, and one-IR regions of the total 23 cp genomes were all aligned using MAFFT 7.308 [45]. The maximum likelihood (ML) tree was reconstructed by RAxML 8.2.11 [46] with the nucleotide substitution model of GTR + Gamma; node support was conducted by means of a bootstrap analysis with 1000 replicates.

Conclusions
In this study, we report and analyze the first complete cp genome of R. chinensis var. spontanea, one of the key ancestors of modern roses and a source for famous traditional Chinese medicines against female diseases. Compared to the cp genomes of other rose species, the cp genome of R. chinensis var. spontanea is the smallest, most likely due to the contraction of IR regions by 31 bps on each IR/LSC border. The cp genome of R. chinensis var. spontanea is rich in SSRs, which are valuable sources for developing new molecular markers. Our phylogenetic analysis showed that sampled species of the genus Rosa formed a monophyletic clade. Rosa chinensis var. spontanea shared a more recent ancestor with R. lichiangensis of the section Synstylae than with R. odorata var. gigantea of the section Chinenses. This supported the hypothesis that, in the traditional taxonomic system, Rosa sections Chinenses and Synstylae were closely related and could be merged to a single section within the genus Rosa. This information will be useful for the conservation genetics of R. chinensis var. spontanea and the phylogenetic study of genus Rosa, and might also facilitate the genetics and breeding of modern roses.
Supplementary Materials: Supplementary materials are available online.