Complete Chloroplast Genome Sequence from Rosa lucieae and Its Characteristics

Rosa lucieae Franch. & Rochebr. ex Crép. is one of the famous wild ancestors of cultivated roses and plays a very important role in horticultural research, but there is still a lack of research on the R. lucieae chloroplast genome. In this study, we used the Illumina MiSeq sequencing platform for sequencing, assembly and annotation to obtain the sequence information for the R. lucieae chloroplast genome and compared genomics, selection 1 stress analysis, and phylogenetic analysis with 12 other chloroplast genomes of Rosa. The R. lucieae cpDNA sequence has a total length of 156,504 bp and 130 genes are annotated. The length of all 13 studied chloroplast genomes is 156,333~157,385 bp. Their gene content, gene sequence, GC content and IR boundary structure were highly similar. Five kinds of large repeats were detected that numbered 100~116, and SSR sequences ranged from 78 to 90 bp. Four highly differentiated regions were identi�ed, which can be used as potential genetic markers for Rosa. Selection stress analysis showed that there was signi�cant positive selection among the 18 genes. The phylogenetic analysis of R. lucieae and R. cymose, R. maximowicziana, R. multi�ora, and R. pricei showed the closest relationship. Overall, our results provide a more comprehensive understanding of the systematic genomics and comparative genomics of Rosa. molecular genome research and genetic clarify the phylogenetic relationships and and provide genomic information for the study of the phylogeny and kinship of for further research and genome, namely, mutational hotspots (Shaw et al. 2007). Four highly variable sites were detected in 13 closely related Rosa species. Five highly variable regions were detected in 28 chloroplast genome sequences of 22 Rosa species. Three regions of the same degree of variability were detected twice, namely, rps16-trnQ (UUG), trnT (UGU)-trnL (UAA) and ycf1. Six highly variable regions were detected in Ji et al.’s (Jeon et al. 2019) study of chloroplast genome mutation hotspots in Rosa plants, two of which were consistent with the results of this study, namely, rps16-trnQ (UUG) and ycf1. The results of our study are similar to those of Jeon et al. (0.7% and 0.6%) in terms of nucleotide variation. These highly variable loci can be used for phylogenetic studies of the Rosa DNA barcode and at the species level.


Introduction
Rosa lucieae Franch. & Rochebr. ex Crép. is a perennial woody vine of Rosa in the family Rosaceae. R. lucieae is synonymous with R. luciae (Jeon et al. 2019). An additional synonym is R. wichuriana Crépin (http://www. oraofalabama.org), which is now revised to R. wichurana (http://www.iplant.cn), one of the most famous wild ancestors of cultivated roses (Debener et al. 2009). R. lucieae plays an important role in horticultural research, especially in breeding, because of its bright leaves, dense owers, long owering period and pleasant aroma, and many horticultural varieties have been cultivated (Lv 2013).
Rosa is a large genus in Rosaceae, with a large number of species, varieties and cultivars. There are approximately 256 species in the genus including 95 species in China, of which 65 species are endemic. It is the modern center of distribution for the genus Rosa (http://www.iplant.cn). Many Rosa species have strong stress resistance and can survive in harsh conditions. They are often used as constructive species for ecological restoration and vegetation restoration . At present, there are few reports on the classi cation and phylogenetic relationships of Rosa based on the chloroplast genome. The study of the phylogenetic relationships of Rosa plays an important role in the protection, introduction, development and utilization of Rosa resources. It also has certain signi cance for the classi cation, phylogeny and genetic diversity protection of Rosa . In future research, it will be necessary to gradually sequence the plastoid genome and nuclear genome of species in Rosa and build a more complete phylogenetic tree of Rosa to clarify the phylogenetic relationships between species in the genus.
Chloroplasts generally exist in some cells of mesophyll and young stems of higher plants, and are also found in algal cells. Chloroplasts have independent genetic information and can semiretain replication. They are very important organelles (Xing et al. 2008). The chloroplast genome consists of four regions: two inverted repeat regions (IRs), a large single-copy region (LSC) and a small single-copy region (SSC). The four regions are connected in the form of covalently closed circular double chains (Raubeson et al. 2005;Jansen et al. 2012). The chloroplast genome is involved in encoding many key proteins in photosynthesis and other metabolic processes (Daniell et al. 2016). Combined with its short genome length, small molecular weight, highly conserved sequence, easy extraction and puri cation, and many SSR sites, the study of chloroplast genome structure and sequence information is of great value in revealing species' origins, evolution and interspeci c genetic relationships (Xing et al. 2008;Liang et al. 2021).
In recent years, the development and application of molecular technology have made rapid progress. Molecular methods have been widely used in plant evolution and phylogeny, for which chloroplast genome sequencing has attracted much attention (Day et al.2014). Researchers have analyzed an increasing number of chloroplast genome sequences. Li et al.  identi ed Prunus sargentii Rehder Chloroplast genome characteristics and codon usage preference. Dong et al. (Dong et al. 2019) and Qu et al. (Qu et al. 2021) analyzed the characteristics of the chloroplast genome and codon usage bias of Eriobotrya fragrans Champ. ex Benth., providing a reference for future research on the evolution and origin of Eriobotrya plant genes and the construction of vectors in the transformation system. Su et al. (Su et al. 2021) sequenced and analyzed the chloroplast genome characteristics and phylogenetic relationships of Lactuca tatarica (L.) These results provide new evidence and a material foundation for species identi cation, phylogeny and resource development and utilization of Mulgedium. In addition, similar results for Rubus Yu et al. 2022), Geum (Li et al. 2020;Zhang et al. 2022), Anacardiaceae , Platanus (Moore et al. 2006), Araceae (Bayly et al. 2013) and other related species have been reported.
The R. lucieae chloroplast genome has not been fully analyzed. Matsumoto et al. (Matsumotoa et al. 1998) constructed a maximum likelihood phylogenetic tree for Rosa using the matK sequence in 1998, and the molecular classi cation conformed closely to traditional botanical classi cation. However, the bootstrap con dence of the phylogenetic tree was relatively low, only 51% to 95%. Jeon et al. (Jeon et al. 2019) assembled the chloroplast genomes of R. multi ora, R. maximowicziana and R. lucieae to compare the genomic characteristics of Sect. Synstylae of subgen. Rosa and compared them with other subordinate groups. However, the phylogenetic relationships among the above three species have not been inferred because the branch lengths of the phylogenetic tree within the column group are short and the support value is low. The phylogenetic tree constructed by Gao et al. (Gao et al. 2020) using the maximum likelihood (ML) method shows that R. lucieae is closely related to R. maximowicziana. Zhao et al. (Zhao et al. 2020) also showed the same results.
Here, we use Illumina sequencing technology to show the complete sequence characteristics and codon usage of the R. lucieae chloroplast genome and plan to compare and analyze the repeat sequence and SSRs, IR boundary, nucleotide variability values and positive selection of the chloroplast genome of several Rosa species to provide a theoretical molecular basis for R. lucieae chloroplast genome research and genetic improvement, clarify the phylogenetic relationships between R. lucieae and other species of Rosa and provide genomic information for the study of the phylogeny and kinship of Rosa for further research and applications.

Taxon Sampling
Fresh young and healthy leaves of R. lucieae were collected from Xishuangbanna Tropical Botanical Garden, Chinese Academy of Sciences, wrapped in tin foil and quickly frozen in liquid nitrogen at -80 ℃ until use.

DNA Extraction and Sequencing
Total genomic DNA was extracted using the modi ed CTAB method (Doyle et al. 1987), and R. lucieae chloroplast genome sequencing was performed using the Illumina sequencing platform by Annoroad Gene Technology Co., Ltd., Beijing, China.

Chloroplast Genome Assembly, Gene Annotation and Relative Synonymous Codon Usage
The sequenced data were ltered and screened. The complete chloroplast genome was assembled using GetOrganelle software , and the chloroplast genome was checked and modi ed with Bandage (Wick et al. 2015). The R. lucieae chloroplast genome (GenBank Accession: MN689791) was downloaded from GenBank as a reference sequence, and Geneious R8.1.3 (Kearse et al. 2012) was used to annotate and manually correct the chloroplast genome of R. lucieae. Organellar Genomedra (OGDRAW) v1.3.1 (https://chlorobox.mpimp-golm.mpg.de/OGDraw. HTML) ( Greiner et al. 2019) was used to perform visual analysis of the genome to obtain the physical map. The assembled and annotated chloroplast genome of R. lucieae was uploaded to GenBank (Accession: OK938394). To reduce error, sequences and repetitive genes with sequence lengths less than 300 bp and internal termination codons were removed from 85 CDs (coding DNA sequences). Finally, 53 gene sequences with AUG as the starting codon and UAA, UAG and UGA as the termination codon were selected for subsequent analysis using CodonW1.4.2 (http://codonw.sourceforge.net).

Contraction and Expansion of IRs
Twelve Rosa species close to R. lucieae were selected for IR boundary contraction and expansion analysis. The IR boundary comparison map was drawn using the IRscope (Https://irscope.shinyapps.io/irapp/) online program (Amiryouse et al. 2018). The parameter was set to the default value.

Sliding Window Analysis
The chloroplast genome sequence was calibrated using MAFFT v.7.129 (Katoh et al. 2013), and DanSP v6.12.03 (Rozas et al. 2017) was used to conduct sliding window analyses and determine the nucleotide diversity (Pi) of 13 chloroplast genome sequences closely related to R. lucieae and all 28 chloroplast genome sequences, with the following parameters: 200 bp step size and 600 bp window length.

Positive selection analysis
Twenty-eight chloroplast genome sequences in Rosa were used to detect positive selection sites in genes. Phylosuite v1.2.1 ) was used to extract the CDS in the sequence and align each CDS using the MAFFT plug-in. The aligned CDS must be checked one by one to manually adjust the small error. After all CDSs are adjusted correctly they are concatenated in series to form a supermatrix and export a FASTA format le. The BI tree was built using the CIPERS online website (https://www.phylo.org/portal2/login!input.action) (Miller et al. 2010) the tree le was exported in Newick format using FigTree v 1.4.3 (http://tree.bio.ed.ac.uk/publications/). EasyCodeml v1.21 ) was used to perform positive selection analysis with the site model in the preset mode.

Phylogenetic Analyses
To reconstruct the phylogenetic relationships among Rosa species, a total of 27 plastid genome sequences were downloaded from GenBank, and 2 species of Geum were selected as outgroups ( Table 1). Construction of the phylogenetic tree used maximum likelihood and Bayesian inference (BI) methods. After sequence alignment using MAFFT version 7 software (Katoh et al. 2013), BioEdit software (Hall et al. 1999) was used to correct the alignment results. ML analysis was performed using IQ-TREE v1.6.1 software (Nguyen et al. 2015). In ML interpretation, 70% and above support values are considered well supported and 50% and below are poorly supported values. MrBayes (version 3.2.6) was used for Bayesian inference (Ronquist et al. 2003). jModelTest (version 2.1.10) (Darriba et al. 2012)

Repeat sequence and SSR analysis
Six types of SSRs (mononucleotide, dinucleotide, trinucleotide, tetranucleotide, pentanucleotide and hexanucleotide repeats) were detected using MISA analysis of 13 closely related Rosa species (Fig. 3A), and 86 SSRs were found in R. lucieae. In the other 12 Rosa species the number of SSRs ranges from 78 to 90. The most abundant type of SSR are mononucleotide repeats, from 44 in R. banksiae to 56 in R. sterilis, followed by dinucleotide repeats, tetranucleotide repeats, trinucleotide repeats, hexanucleotide repeats and pentanucleotide repeats. Further study found that most SSRs are located in the LSC region, followed by the IR and SSC regions (Fig. 3B). Eighty-six SSRs are detected in R. lucieae, of which the number of A repeats and T repeats in mononucleotide repeats was the most frequent, accounting for 59.3%, followed by tetranucleotide repeats, accounting for 13.95%, dinucleotide repeats, accounting for 12.79%, and only one pentanucleotide repeat (Fig. 3C). The repeats of 13 Rosa species were analyzed. A total of 51 tandem repeats and 50 scattered repeats were found in R. lucieae. Among the other 12 Rosa species, 100-116 repeats were detected, except that R. minutifolia and R. odorata do not contain complementary repeat sequences, and all other species contain ve types of repeats. Eighteen forward repeats (F), 15 reverse repeats (R), 16 palindromic repeats (P), and 1 complementary repeat (C) were detected (Fig.  3D). Among these, the number of tandem repeats is large, mainly distributed in the LSC region, followed by the IR region and SSC region (Fig. 3E). Among the 51 tandem repeats, six were located in the exon, 2 in the intron and 43 in the intergenic region, accounting for 11.8%, 3.9% and 84.3% of the total repeats, respectively (Fig. 3F), and 28 were located in the LSC region, four in the SSC region and 19 in the IR region, accounting for 54.9%, 7.8% and 37.3%, respectively (Fig. 3G).

Inverted Repeat Contraction and Expansion Analysis
By comparing the expansion and contraction of the IR/SC boundary of 13 Rosa chloroplast genomes, it can be seen that the chloroplast genomes of 13 Rosa plants have high similarity on the IR/SC boundary, and the boundary genes are consistent (Fig. 4). The boundary gene between IRb and LSC is rpl2, and the boundary gene between SSC and IRa and IRb is ycf1. Although the ycf1 gene of R. lucieae did not pass through the IRb/SSC boundary, other species crossed the boundary. Overall, the length and structure of the IR region in the genomes of 13 Rosa species are similar. Sliding Window Analysis DnaSP 6.0 software was used to calculate the nucleotide variation value (π) within 600 bp of the chloroplast genome of R. sterilis, R. roxburghii, R. lucidissima, R. laevigata, R. lipes, R. chinensis, R. banksiae, R. pricei, R. odorata, R. maximowicziana, R. cymosa, and R. minutifolia. The differences between the thirteen Rosa species varied from 0 to 0.00936, with an average of 0.00181, suggesting that their genomic differences are small. However, four highly variable loci with much higher π values (π > 0.007), including trnK (UUU), rps16-trnQ (UUG), trnT (UGU)-trnL (UAA), and ycf1, were precisely located (Fig. 5A). Among the twenty-eight Rosa sequences and the two Geum sequences, the π values varied from 0 to 0.01166 with a mean of 0.00284, indicating that the differences among Rosaceae species are larger than those between congeneric species. Four highly variable loci included rps16-trnQ (UUG), trnT (UGU)-trnL (UAA), psbE-petL and ycf1.

Phylogenetic Analysis
Two chloroplast genome sequences of Geum in Rosaceae were selected as outgroups, and 28 chloroplast genome sequences of Rosa were combined to construct phylogenetic trees using IQ-tree (Fig. 6). The phylogenetic relationships indicate that R. lucieae is closely related to R. maximowicziana, R. multi ora, R. cymosa, and R. pricei. They belong to Sect. Synstylae and the Sect. Banksianae, followed by a close relationship between R. odorata and its varieties. In addition, R. roxburghii and R. banksiae are independent branches, and R. praelucens, R. davurica, R. acicularis, R. kokanica, R. hybrid, R. minutifolia and R. rugosa are branches. R. xanthina is a separate branch. The molecular phylogenetic tree constructed using the maximum likelihood method was basically consistent with the topological complement structure of the BI tree, but the branch support value of the BI tree was high, and the molecular phylogenetic tree constructed by the BI method was selected as the main method ( Supplementary Fig. S1). The molecular phylogenetic BI tree topology constructed by CDS with 28 sequences is also basically the same (Supplementary Fig. S2).

Discussion
Comparison of cp genomes in the Rosa species This study describes the chloroplast genome of R. lucieae, an ancient vine ornamental plant. Its quantitative characteristics are similar to those of other reported plants in Rosa (Table 1). The largest number of annotated genes in the chloroplast genome of Rosa species was 140 (R. cymosa, MT471268; R. laevigata var. leiocarpa, NC_047418), with its CDS also reaching a maximum of 92. Of all annotated genes, the ycf15 gene was only annotated in R. multi ora (NC039989), R. lipes (NC053856) and R. cymose (NC051550), and the ycf68 gene was only annotated in R. multi ora (NC039989) and R. cymose (NC051550) (Jeon et al. 2019;Wang et al. 2021;Ding et al. 2020). Lu et al. (Lu et al. 2017) and Raubeson et al. (Raubeson et al. 2007), discussed whether the ycf15 and ycf68 genes are pseudogenes or protein coding genes. In R. lucieae, the length of these two genes is short, so they were not annotated. In the study of IR/SC boundaries, ycf1 and ycf2 genes are located at the junction of the IR region and LSC and SSC regions and have the same incomplete replication as observed in other studies (Li et al. 2013;Song et al. 2015).
These results are consistent with most other studies. The codons of each gene of the R. lucieae chloroplast genome mostly end with A or U, and there is a preference for use, such as in Medicago truncatula (Yang et al. 2015), Pinus massoniana (Ye et al. 2018), and Dalbergia odorifera (Yuan et al. 2021). This shows that there are some similarities in codon preference among different species.

Sliding Window Analysis
In addition to random genetic variation events, some mutations constitute highly variable regions in the genome, namely, mutational hotspots (Shaw et al. 2007). Four highly variable sites were detected in 13 closely related Rosa species.

Positive selection analysis
Nonsynonymous substitution (Ka) and synonymous substitution (Ks) and their ratio (Ka/Ks), similar to (dN/dS), have been used to assess the natural selection pressure and evolution rate of nucleotides (Ninio et al. 1984;Yang et al. 2000). In this study, the genes identi ed as positive selection sites were the ATP synthase gene (atpF), Maturase K gene (matK), NADH dehydrogenase gene (ndhD, ndhH, ndhJ, ndhK), Cytochrome b/f complex gene (petB), Photosystem I gene (psaA), Photosystem II gene (psbA, psbB, psbC), Rubiscolarge subunit gene (rbcL), Ribosomal proteins (LSU) gene (rpl20, rpl23), RNA polymerase gene (rpoA), and hypothetical chloroplast reading frames (ycf1, ycf2, ycf4). The amino acid changes from site mutation, caused by selection pressure, can drive evolution within a speci c classi cation pedigree (Nawae et al. 2020). In the process of positive selection favorable amino acid changes increase plant adaptation to ecological habitats (Sen et al. 2011). Compared with other genus studies, positive selection of multiple loci was found in Rosa and many genes were involved (Rono et al. 2020;Sheng et al. 2021;Huang et al. 2020;Xie et al. 2018). It is speculated that the reason is that most Rosa plants are widely welcomed as ornamental plants. To obtain better characteristics, such as color and taste, Rosa plants have undergone many introductions and hybridizations. The occurrence of an abnormal increase in positive selection is a formal genetic change to adapt to diverse climate and environmental conditions (https://www.britannica.com). Many positive selection genes found in this study were also found to have positive selection in other plants and to be involved in the adaptive evolution of plants. These include matK, atpF, psbA, ycf, ycf2, and rbcL (Bock et al. 2014). For example, several studies have found that adaptive evolution of the rbcL gene is related to photosynthetic performance under changes in temperature, drought and carbon dioxide concentrations (Sheng et al. 2021;Galmes et al. 2014;Kapralov et al. 2012). The ndings in this study are consistent with previous studies, and nine positive selection sites were found in the rbcL gene. The other two genes with more positive selection sites, ycf2 and ycf1, play a key role in cell viability (Drescher et al. 2000). Kikuchi et al. (Kikuchi et al. 2013) observed that the ycf1 gene was involved in the synthesis of endometrial complexes for protein transport. In addition, the positive selection of the photosynthetic genes rbcL, ndh and psb was related to the adaptation of rice to different sunlight levels . It is speculated that the positive selection of the same gene in Rosa is also related to the level of sunlight. These results can provide a data reference for studying the adaptive evolution of Rosa plants.

Phylogenetic Analysis
According to the Flora of China Crep. This shows that the genetic relationships obtained from traditional plant classi cation and those based on DNA are different. The latter, by analyzing the genetic variation of plastid genome sequences, infers evolution among plant groups and explores their phylogenetic relationships, playing an important role in revealing plant systematics and evolution (Zhu 2014). The phylogenetic tree shows that R. lucieae (MG727864) is closely related to R. maximowicziana, which is consistent with the research results of Zhao et al. ) and Gao et al. (Gao et al. 2020).

Conclusions
In this study, the whole genome sequence of R. lucieae chloroplasts was sequenced and assembled, and a physical map of the R. lucieae chloroplast genome was obtained. The repetitive sequences, IR boundaries, codons and SNPs of the chloroplast genomes of 13 species with close genetic relationships in Rosa were compared and analyzed. Positive selection analysis of 28 chloroplast genome sequences in Rosa was carried out and a phylogenetic tree was constructed to clarify the genetic relationships of R. lucieae within Rosa. These studies provide new references for species identi cation, marker development and utilization, genetic breeding and phylogenetic evolution of R. lucieae and provide a more comprehensive understanding of the systematic genomics and comparative genomics of Rosa. Figure 1 Gene map of the chloroplast genome of R. lucieae