Molecular Structure and Phylogenetic Analyses of Complete Chloroplast Genomes of Two Aristolochia Medicinal Species

The family Aristolochiaceae, comprising about 600 species of eight genera, is a unique plant family containing aristolochic acids (AAs). The complete chloroplast genome sequences of Aristolochia debilis and Aristolochia contorta are reported here. The results show that the complete chloroplast genomes of A. debilis and A. contorta comprise circular 159,793 and 160,576 bp-long molecules, respectively and have typical quadripartite structures. The GC contents of both species were 38.3% each. A total of 131 genes were identified in each genome including 85 protein-coding genes, 37 tRNA genes, eight rRNA genes and one pseudogene (ycf1). The simple-sequence repeat sequences mainly comprise A/T mononucletide repeats. Phylogenetic analyses using maximum parsimony (MP) revealed that A. debilis and A. contorta had a close phylogenetic relationship with species of the family Piperaceae, as well as Laurales and Magnoliales. The data obtained in this study will be beneficial for further investigations on A. debilis and A. contorta from the aspect of evolution, and chloroplast genetic engineering.


Introduction
The traditional Chinese medicine plants, Aristolochia debilis and Aristolochia contorta, are herbaceous climbers in the family Aristolochiaceae. Aristolochiae fructus originates from the mellow fruit of the two species, while Aristolochiae herba originates from their dried aerial parts. Aristolochiae fructus and Aristolochiae herba have been recorded as traditional herbal medicines which can clear lung-heat to stop coughing and activate meridians to stop pain, respectively [1]. Modern pharmacology studies have shown that the primary chemical constituents of the two species are aristolochic acid analogues including aristolochic acids (AAs) and aristolactams (ALs) [2,3]. AAs and ALs have been found among species from the family Aristolochiaceae [4]. Previous researches have revealed that AAs are able to react with DNA to form covalent dA-aristolactam (dA-AL) and dG-aristolactam (dG-AL) adducts [5,6]. With further research, current evidence from studies of AAs has demonstrated that AAs can cause nephrotoxicity, carcinogenicity, and mutagenicity [7][8][9][10], especially after prolonged low-dose Both species displayed a typical quadripartite structure, and the corresponding regions were of similar lengths. The complete chloroplast genome of A. debilis is a circular molecule of 159,793 bp in length comprising a large single-copy (LSC) region of 89,609 bp and a small single-copy (SSC) region of 19,834 bp separated by a pair of inverted repeats (IRs), each 25,175 bp in length ( Figure 1, Table 1). The complete chloroplast genome of A. contorta is 160,576 bp in length, which is divided into one LSC (89,781 bp), one SSC (19,877 bp) and two IRs, each 25,459 bp in length ( Figure 2, Table 1). Table 1. Base composition in the chloroplast genomes of A. debilis and A. contorta.

Species Regions Positions T(U) (%) C (%) A (%) G (%) Length (bp)
A. contorta    The analysis results revealed that both species had a GC content of 38.3%. However, this was unevenly distributed across the whole chloroplast genome. In both species, the GC content exhibited the highest values of the IR regions across the complete chloroplast genome, 43.4% both in A. debilis and A. contorta. The high GC content in the IR regions was the result of four rRNA genes (rrn16, rrn23, rrn4.5 and rrn5) that occur in this region [31]. In addition, the LSC regions have GC contents of 36.6% and 35.5%, as well as the lowest values of 32.9% and 32.8% are seen in SSC regions in A. debilis and A. contorta, respectively. Within the protein-coding regions (CDS) of chloroplast genome of A. debilis, the percentage of AT content for the first, second and third codon positions were 54%, 61.4% and 68%, respectively (Table 1). A bias towards a higher AT representation at the third codon position has also been observed in other land plant chloroplast genomes [32][33][34].

LSC
A total of 131 genes were identified from each genome including 85 protein-coding genes, 37 tRNAs, eight rRNAs, and one pseudogene (ycf1) ( Table 2). The functional ycf1 copy existed encompassing IR-SSC boundary and the other pseudogene ycf1 copy was on the other IR region. Six protein-coding genes, seven tRNA genes, and all rRNA genes were duplicated in the IR regions. Coding regions including protein-coding genes (CDS), tRNAs, and rRNAs constituted 56.7% and 56.4% in the chloroplast genomes of A. debilis and A. contorta, respectively; while the non-coding The analysis results revealed that both species had a GC content of 38.3%. However, this was unevenly distributed across the whole chloroplast genome. In both species, the GC content exhibited the highest values of the IR regions across the complete chloroplast genome, 43.4% both in A. debilis and A. contorta. The high GC content in the IR regions was the result of four rRNA genes (rrn16, rrn23, rrn4.5 and rrn5) that occur in this region [31]. In addition, the LSC regions have GC contents of 36.6% and 35.5%, as well as the lowest values of 32.9% and 32.8% are seen in SSC regions in A. debilis and A. contorta, respectively. Within the protein-coding regions (CDS) of chloroplast genome of A. debilis, the percentage of AT content for the first, second and third codon positions were 54%, 61.4% and 68%, respectively (Table 1). A bias towards a higher AT representation at the third codon position has also been observed in other land plant chloroplast genomes [32][33][34].
Introns play an important role in the regulation of gene expression and can enhance the expression of exogenous genes at specific sites and specific times of the plant [35]. The intron content of genes reserved in the chloroplast genomes of A. debilis and A. contorta are maintained in other angiosperms [31,36]. Data revealed the presence of 18 genes containing introns in each chloroplast genome, including atpF, rpoC1, ycf3, rps12, rpl2, rpl16, clpP, petB, petD, rps16, ndhA, ndhB, and six tRNA genes (Table 3). In addition, the ycf3 gene and rps12 gene each contain two introns and three exons. The ycf3 gene is located in LSC region and the rps12 gene is a special trans-splicing gene, the 5 exon is located in LSC, while the 3 exon is located in IR, which is similar to that in Aquilaria sinensis [25], Panax ginseng [36] and Cistanche deserticola [37]. Table 3. Genes with introns in the chloroplast genomes of A. debilis and A. contorta as well as the lengths of the exons and introns.

IR Contraction and Expansion
Although genomic structure and size were highly conserved in Angiosperms chloroplast genomes, the IR/SC boundary regions still varied slightly ( Figure 3). The contraction and expansion at the borders of the IR regions are common evolutionary events and represent the main reasons for size variation of the chloroplast genomes [33,38]. From Figure 3, the junctions of the IR and LSC regions of four species including Arabidopsis thaliana (accession number: NC_000932), Nicotiana tabacum (NC_001879), as well as two Aristolochia species were compared. The IRb/SSC border extended into the ycf1 genes to cause long ycf1 pseudogenes in all species; however, compared with A. thaliana and N. tabacum, the length of ycf1 pseudogene of two Aristolochia species were only 171 and 169 bp, respectively. The IRa/SSC border was located in the CDS of the ycf1 gene and expanded the same length into the 5 portion of ycf1 gene as IRb expanded in the four chloroplast genomes. The trnH genes were located in the LSC regions in Nicotiana tabacum, Sesamum indicum, Arabidopsis thaliana, and Salvia miltiorrhiza [31], while this gene was usually located in the IR region in the monocot chloroplast genomes [39]. Interestingly, the IRa/LSC borders were located in the coding region of trnH genes in the two Aristolochia species.

IR Contraction and Expansion
Although genomic structure and size were highly conserved in Angiosperms chloroplast genomes, the IR/SC boundary regions still varied slightly ( Figure 3). The contraction and expansion at the borders of the IR regions are common evolutionary events and represent the main reasons for size variation of the chloroplast genomes [33,38]. From Figure 3, the junctions of the IR and LSC regions of four species including Arabidopsis thaliana (accession number: NC_000932), Nicotiana tabacum (NC_001879), as well as two Aristolochia species were compared. The IRb/SSC border extended into the ycf1 genes to cause long ycf1 pseudogenes in all species; however, compared with A. thaliana and N. tabacum, the length of ycf1 pseudogene of two Aristolochia species were only 171 and 169 bp, respectively. The IRa/SSC border was located in the CDS of the ycf1 gene and expanded the same length into the 5′ portion of ycf1 gene as IRb expanded in the four chloroplast genomes. The trnH genes were located in the LSC regions in Nicotiana tabacum, Sesamum indicum, Arabidopsis thaliana, and Salvia miltiorrhiza [31], while this gene was usually located in the IR region in the monocot chloroplast genomes [39]. Interestingly, the IRa/LSC borders were located in the coding region of trnH genes in the two Aristolochia species.

Codon Usage and RNA Editing Sites
All the protein-coding genes were composed of 26,239 and 26,255 codons in the chloroplast genomes of A. debilis and A. contorta, respectively. Among these codons, 2737 encode leucine and 315 encode cysteine, respectively, the most and least universal amino acids in the A. debilis chloroplast

Codon Usage and RNA Editing Sites
All the protein-coding genes were composed of 26,239 and 26,255 codons in the chloroplast genomes of A. debilis and A. contorta, respectively. Among these codons, 2737 encode leucine and 315 encode cysteine, respectively, the most and least universal amino acids in the A. debilis chloroplast genome. The codon usages of protein-coding genes in the A. debilis and A. contorta chloroplast genomes are deduced and summarized in Figure 4 and Table S1. Figure 4 shows that the relative synonymous codon usage (RSCU) value increased with the quantity of codons that code for a specific amino acid. Most of the amino acid codons have preferences except for methionine and tryptophan. The results presented here are similar in codon usage with the chloroplast genomes of species within the genus Ulmus [40] and Aq. sinensis [25]. In addition, potential RNA editing sites were predicted for 35 genes of the chloroplast genomes of two species. A total of 92 RNA editing sites were identified (Table S2). The amino acid conversion S to L occurred most frequently, while P to S and R to W occurred least. Seventy-six common RNA editing sites were shared in genes of the two species. genome. The codon usages of protein-coding genes in the A. debilis and A. contorta chloroplast genomes are deduced and summarized in Figure 4 and Table S1 . Figure 4 shows that the relative synonymous codon usage (RSCU) value increased with the quantity of codons that code for a specific amino acid. Most of the amino acid codons have preferences except for methionine and tryptophan. The results presented here are similar in codon usage with the chloroplast genomes of species within the genus Ulmus [40] and Aq. sinensis [25]. In addition, potential RNA editing sites were predicted for 35 genes of the chloroplast genomes of two species. A total of 92 RNA editing sites were identified (Table S2). The amino acid conversion S to L occurred most frequently, while P to S and R to W occurred least. Seventy-six common RNA editing sites were shared in genes of the two species.

Repeat Structure and Simple Sequence Repeats Analyses
The repeats were mostly distributed in the intergenic spacer (IGS) and intron sequences. Figure  5 shows the repeat structure analyses of six species. The results revealed that the repeats of chloroplast genome of A. contorta had the greatest number, comprising of 41 forward, 43 palindromic, 29 reverse, and 25 complement repeats. Followed by A. debilis, contained 14 forward, 23 palindromic, 23 reverse, and six complement repeats. Simple sequence repeats (SSRs), which are ubiquitous throughout the genomes and are also known as microsatellites, are tandemly repeated DNA sequences that consist of 1-6 nucleotide repeat units [41]. SSRs are widely used for molecular markers in species identification, population genetics, and phylogenetic investigations based on their high level of polymorphism [42][43][44]. A total of 129 and 156 SSRs were identified using the microsatellite identification tool (MISA) in the chloroplast genomes of A. debilis and A. contorta, respectively (Table 4; Tables S3 and S4). In these SSRs, mononucletide repeats were largest in number, which were found 81 and 96 times in A. debilis and A. contorta, respectively. A/T mononucleotide repeats (96.3% and 94.8%, respectively) were the most common, while the majority of dinucleotide repeat sequences comprised of AT/TA repeats (100% and 92.8%, respectively). This result agreed with the previous studies where proportions of polyadenine (polyA) and polythymine

Repeat Structure and Simple Sequence Repeats Analyses
The repeats were mostly distributed in the intergenic spacer (IGS) and intron sequences. Figure 5 shows the repeat structure analyses of six species. The results revealed that the repeats of chloroplast genome of A. contorta had the greatest number, comprising of 41 forward, 43 palindromic, 29 reverse, and 25 complement repeats. Followed by A. debilis, contained 14 forward, 23 palindromic, 23 reverse, and six complement repeats. Simple sequence repeats (SSRs), which are ubiquitous throughout the genomes and are also known as microsatellites, are tandemly repeated DNA sequences that consist of 1-6 nucleotide repeat units [41]. SSRs are widely used for molecular markers in species identification, population genetics, and phylogenetic investigations based on their high level of polymorphism [42][43][44]. A total of 129 and 156 SSRs were identified using the microsatellite identification tool (MISA) in the chloroplast genomes of A. debilis and A. contorta, respectively (Table 4; Tables S3 and S4). In these SSRs, mononucletide repeats were largest in number, which were found 81 and 96 times in A. debilis and A. contorta, respectively. A/T mononucleotide repeats (96.3% and 94.8%, respectively) were the most common, while the majority of dinucleotide repeat sequences comprised of AT/TA repeats (100% and 92.8%, respectively). This result agreed with the previous studies where proportions of polyadenine (polyA) and polythymine (polyT) were higher than polycytosine (polyC) and polyguanine (polyG) within chloroplast SSRs in many plants [24]. (polyT) were higher than polycytosine (polyC) and polyguanine (polyG) within chloroplast SSRs in many plants [24].

Comparative Genomic Analysis
The whole chloroplast genome sequences of A. debilis and A. contorta were compared to those of Calycanthus floridus var. glaucus (accession number: NC_004993), Magnolia officinalis (NC_020316), and Liriodendron chinense (NC_030504) using the mVISTA program ( Figure 6). The comparison showed that the two IR regions were less divergent than the LSC and SSC regions. The four rRNA genes were the most conserved, while the most divergent coding regions were ndhF, rpl22, ycf1, rpoC2 and ccsA. Additionally, the results revealed that non-coding regions exhibited a higher divergence than coding regions, and the most divergent regions localized in the intergenic spacers among the five chloroplast genomes.

Comparative Genomic Analysis
The whole chloroplast genome sequences of A. debilis and A. contorta were compared to those of Calycanthus floridus var. glaucus (accession number: NC_004993), Magnolia officinalis (NC_020316), and Liriodendron chinense (NC_030504) using the mVISTA program ( Figure 6). The comparison showed that the two IR regions were less divergent than the LSC and SSC regions. The four rRNA genes were the most conserved, while the most divergent coding regions were ndhF, rpl22, ycf1, rpoC2 and ccsA. Additionally, the results revealed that non-coding regions exhibited a higher divergence than coding regions, and the most divergent regions localized in the intergenic spacers among the five chloroplast genomes. Figure 6. Sequence identity plot comparing the five chloroplast genomes with A. debilis as a reference by using mVISTA. Grey arrows and thick black lines above the alignment indicate genes with their orientation and the position of the IRs, respectively. A cut-off of 70% identity was used for the plots, and the Y-scale represents the percent identity ranging from 50% to 100%.

Phylogenetic Analyses
Chloroplast genomes provide abundant resources, which are significant for evolutionary, taxonomic, and phylogenetic studies [31,45,46]. The whole chloroplast genomes and protein-coding genes have been successfully used to resolve phylogenetic relationships at almost any taxonomic level during the past decade [31,37]. Aristolochia, consisting of nearly 400 species, is the largest genus in the family Aristolochiaceae [30]. Phylogenetic analyses employing one or several genes have been performed in previous studies [47][48][49]; however, these analyses were restricted to the species of Aristolochiaceae and included few species from other families. In this study, to identify the phylogenetic positions of A. debilis and A. contorta within Angiosperms, 60 protein-coding genes Figure 6. Sequence identity plot comparing the five chloroplast genomes with A. debilis as a reference by using mVISTA. Grey arrows and thick black lines above the alignment indicate genes with their orientation and the position of the IRs, respectively. A cut-off of 70% identity was used for the plots, and the Y-scale represents the percent identity ranging from 50% to 100%.

Phylogenetic Analyses
Chloroplast genomes provide abundant resources, which are significant for evolutionary, taxonomic, and phylogenetic studies [31,45,46]. The whole chloroplast genomes and protein-coding genes have been successfully used to resolve phylogenetic relationships at almost any taxonomic level during the past decade [31,37]. Aristolochia, consisting of nearly 400 species, is the largest genus in the family Aristolochiaceae [30]. Phylogenetic analyses employing one or several genes have been performed in previous studies [47][48][49]; however, these analyses were restricted to the species of Aristolochiaceae and included few species from other families. In this study, to identify the phylogenetic positions of A. debilis and A. contorta within Angiosperms, 60 protein-coding genes commonly present in 37 species from Piperales, Laurales, Magnoliales, Ranunculales, Fabales, Rosales, Chloranthales, as well as two Aristolochia species were used to construct the phylogenetic tree using the Maximum parsimony (MP) method (Figure 7). All the nodes in the MP trees have high bootstrap support values, and 30 out of 34 nodes with 100% bootstrap values were found. The result illustrated that two Aristolochia species were sister taxa with respect to four Piper species (Piperaceae), and these species were grouped with four species from Laurales and five species from Magnoliales. Additionally, all species are clustered within a lineage distinct from the outgroup. This result (inferred from the chloroplast genome data) obtained high support values, which suggested that the chloroplast genome could effectively resolve the phylogenetic positions and relationships of this family. Nevertheless, to accurately illustrate the evolution of the family Aristolochiaceae, it is necessary to use more species to analyze the phylogeny. This study will also provide a reference for species identification among Aristolochia and other genus using the chloroplast genome. commonly present in 37 species from Piperales, Laurales, Magnoliales, Ranunculales, Fabales, Rosales, Chloranthales, as well as two Aristolochia species were used to construct the phylogenetic tree using the Maximum parsimony (MP) method ( Figure 7). All the nodes in the MP trees have high bootstrap support values, and 30 out of 34 nodes with 100% bootstrap values were found. The result illustrated that two Aristolochia species were sister taxa with respect to four Piper species (Piperaceae), and these species were grouped with four species from Laurales and five species from Magnoliales. Additionally, all species are clustered within a lineage distinct from the outgroup. This result (inferred from the chloroplast genome data) obtained high support values, which suggested that the chloroplast genome could effectively resolve the phylogenetic positions and relationships of this family. Nevertheless, to accurately illustrate the evolution of the family Aristolochiaceae, it is necessary to use more species to analyze the phylogeny. This study will also provide a reference for species identification among Aristolochia and other genus using the chloroplast genome.

Plant Material, DNA Extraction, and Sequencing
Fresh plants of A. debilis and A. contorta were collected from Lichuan City in Hubei Province and Tonghua City in Jilin Province, respectively. All samples were identified by Professor Yulin Lin, who is based at the Institute of Medicinal Plant Development (IMPLAD), Chinese Academy of Medical Sciences (CAMS) and Peking Union Medical College (PUMC). The voucher specimens were deposited in the herbarium of the IMPLAD. The leaves were cleansed and preserved in a −80 • C refrigerator. Total genomic DNA was extracted from approximately 100 mg of samples using DNeasy Plant Mini Kit with a standard protocol (Qiagen Co., Hilden, Germany). Final DNA quality was assessed based on spectrophotometry and their integrity was examined by electrophoresis in 1% (w/v) agarose gel. The DNA was used to construct shotgun libraries with insert sizes of 500 bp and sequenced according to the manufacturer's manual for the Illumina Hiseq X. Approximately 6.3 Gb of raw data from A. debilis and 5.8 Gb from A. contorta were produced with 150 bp pair-end read lengths.

Chloroplast Genome Assembly and Annotation
First, we used the software Trimmomatic (v0.36, Max Planck Institute of Molecular Plant Physiology, Potsdam, Germany) [50] to trim the low-quality reads. After quality control, the clean reads were used to assemble the chloroplast genome. All chloroplast genomes of plants recorded in the National Center for Biotechnology Information (NCBI) were used to construct a reference database. Next, the clean reads were mapped to the database on the basis of their coverage and similarity, and the mapped reads extracted. Extracted reads were assembled to contigs using SOAPdenovo (v2, BGI HK Research Institute, Hong Kong, China) [51], and the resulting contigs were combined and extended to obtain a complete chloroplast genome sequence. To verify the accuracy of assembly, four boundaries of single copy (SC) and inverted repeat (IR) regions of the assembled sequences were confirmed by PCR amplification and Sanger sequencing using the primers listed in Table S5.
We used the online program Dual Organellar GenoMe Annotator (DOGMA), (University of Texas at Austin, Austin, TX, USA) [52] and the software Chloroplast Genome Annotation, Visualization, Analysis, and GenBank Submission (CPGAVAS), (Institute of Medicinal Plant Development, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China) [53] coupled with manual corrections to perform the preliminarily gene annotation of chloroplast genomes of two species. The tRNA genes were identified using the software tRNAscan-SE (v2.0, University of California, Santa Cruz, CA, USA) [54] and DOGMA [52]. The gene map was drawn using the Organellar Genome DRAW (OGDRAW) (v1.2, Max Planck Institute of Molecular Plant Physiology, Potsdam, Germany) [55] with default settings and checked manually. The complete and correct chloroplast genome sequences of the two species were deposited in GenBank, accession numbers of A. debilis and A. contorta are MF539928 and MF539927, respectively.

Genome Structure Analyses and Genome Comparison
The distribution of codon usage was investigated using the software CodonW ( University of Texas, Houston, TX, USA) with the RSCU ratio [56]. Thirty-five protein-coding genes of the chloroplast genomes of two species were used to predict potential RNA editing sites using the online program Predictive RNA Editor for Plants (PREP) suite [57] with a cutoff value of 0.8. GC content was analyzed using Molecular Evolutionary Genetics Analysis (MEGA v6.0, Tokyo Metropolitan University, Tokyo, Japan) [58]. REPuter ( University of Bielefeld, Bielefeld, Germany) [59] to identify the size and location of repeat sequences, including forward, palindromic, reverse, and complement repeats in the chloroplast genomes of six species C. floridus var. glaucus, M. officinalis, L. chinense and Piper nigrum (NC_034692). For all repeat types, the minimal size was 30 bp and the two repeat copies had at least 90% similarity. Simple sequence repeats (SSRs) were detected using MISA software [60] with parameters set the same as Li et al. [61]. The whole-genome alignment for the chloroplast genomes of the five species including A. debilis, A. contorta, C. floridus var. glaucus, M. officinalis, and L. chinense were performed and plotted using the mVISTA program [62].

Phylogenetic Analyses
A total of 35 complete chloroplast genomes were downloaded from the NCBI Organelle Genome Resources database (Table S6). The 60 protein-coding gene sequences commonly present in 37 species, including the two species in this study, were aligned using the Clustal algorithm [63]. To determine the phylogenetic positions of A. debilis, and A. contorta, we analyzed the chloroplast genomes of these 60 protein-coding genes. Maximum parsimony (MP) analysis was performed with PAUP*4.0b10 [64], using a heuristic search performed with the MULPARS option, the random stepwise addition with 1000 replications and tree bisection-reconnection (TBR) branch swapping. Bootstrap analysis was also performed with 1,000 replicates with TBR branch swapping.

Conclusions
The complete chloroplast genomes of A. debilis and A. contorta, the first two sequenced members of the family Aristolochiaceae, were determined in this study. The genome structure and gene content were relatively conserved. The phylogenetic analyses illustrated that these two Aristolochia species were positioned close to four species from the family Piperaceae and had a close phylogenetic relationship with Laurales and Magnoliales. The results provided the basis for the study of the evolutionary history of A. debilis and A. contorta. All the data presented in this paper will facilitate the further investigation of these two medicinal plants.