The Complete Chloroplast Genome of Heimia myrtifolia and Comparative Analysis within Myrtales

Heimia myrtifolia is an important medicinal plant with several pharmacologically active alkaloids and is also used as an ornamental landscape plant. The purpose of this study is to complete and characterize the chloroplast (cp) genome of H. myrtifolia and compare genomic features to other Myrtales species’ cp genomes. The analysis showed that H. myrtifolia has a total length of 159,219 bp with a typical quadripartite structure containing two identical inverted repeats (IRs) of 25,643 bp isolated by one large single copy (LSC) of 88,571 bp and one small single copy (SSC) of 18,822 bp. The H. myrtifolia cp genome contains 129 genes with eight ribosomal RNAs, 30 transfer RNAs, and 78 protein coding genes, in which 17 genes are duplicated in two IR regions. The genome organization including gene type and number and guanine-cytosine (GC) content is analyzed among the 12 cp genomes in this study. Approximately 255 simple sequence repeats (SSRs) and 16 forward, two reverses, and two palindromic repeats were identified in the H. myrtifolia cp genome. By comparing the whole H. myrtifolia cp genome with 11 other Myrtales species, the results showed that the sequence similarity was high between coding regions while sequence divergence was high between intergenic regions. By employing the full cp genomes for phylogenetic analysis, structural and sequence differences were characterized between H. myrtifolia and 11 Myrtales species illustrating what patterns are common in the evolution of cp genomes within the Myrtales. The first entire cp genome in the genus Heimia provides a valuable resource for further studies in these medicinally and ornamentally important taxa.


Introduction
Heimia is a genus of flowering plants in the loosestrife family, Lythraceae (Order Myrtales), named in honor of German physician Ernst Ludwig Heim [1]. The genus Heimia is comprised of three woody shrub species with five-petaled yellow flowers and a bell-shaped or hemispherical calyx tube, and is commonly known as "sun opener" or "shrubby yellowcrest". The Heimia species are distributed from west Texas and northern Mexico in the north to Argentina in the southern part of the range. Heimia species have a history of medicinal use in native American cultures, in which several pharmacologically active alkaloids have been found, chief among them being cryogenine [2,3].
Heimia myrtifolia has been reported to have hallucinogenic properties wherewith objects appear yellow accompanied with auditory hallucinations [3]. Anti-inflammatory properties have also been attributed to the alkaloid cryogenine in Heimia [4]. Given the attractive yellow flowers that Heimia species produce and its shrubby form, it is highly valued as ornamental plant.
Chloroplasts (cp), are essential organelles that convert light energy to chemical energy in chlorophytes and possess their own genomes for biosynthesis of pigments, starch, amino acids, and fatty acids, encoding proteins for photosynthesis and nitrogen fixation [5]. Compared with nuclear genomes, cp genomes have highly conserved gene order, number, and content, and are uniparentally inherited [6]. Most angiosperms' cp genomes are typically circular with a quadripartite structure ranging from 115 to 165 kb in length and include two inverted repeated regions (IRs) which are separated by the small single copy region (SSC) and the large single copy region (LSC) [7]. Because of their conserved structure, uniparental inheritance, and similar gene content, DNA sequences from cp genomes have been important in systematic, population genetic, and phylogenetic studies. Previously, phylogenetic trees have been reconstructed from one or a few genes from the cp [8]. However, in recent years, complete cp genomes have been increasingly used as an informative resource for resolving lower taxonomic level phylogenetic relationships [9][10][11][12][13][14][15].
By comparing entire cp genomes, the ability to detect reliable DNA barcodes for precise plant identification is improved. As next-generation sequencing costs fall, cp genomes are more routinely integrated into phylogenetic, population genetics, and DNA barcoding for identification of numerous species and families [9,10,13,[16][17][18][19][20][21]. The over 2300 cp genomes that have been deposited in the National Center for Biotechnology Information (NCBI) database illustrates the importance and utility of whole cp genomes for the study of plant evolution.
Herein, we present the first whole cp genome sequence generated from Illumina sequencing in the genus Heimia. This complete cp genome will be a valuable genetic resource for comprehensively understanding the organization of the H. myrtifolia cp genome and studying phylogenetic relationships within the Lythraceae family and Myrtales generally. Our study objectives were as follows: to enhance our understanding of the structural diversity of the H. myrtifolia genome and detect highly informative hotspot markers from comparative analyses with other cp genomes in Lythraceae and Myrtales.

Chloroplast Genome Structure and Content
The H. myrtifolia cp genome is 159,219 bp ( Figure 1) in length and similar to other Myrtales cp genomes (Tables 1 and 2), which vary in length from 152 to 165 Kb [20,22]. Unsurprisingly, the cp DNA of H. myrtifolia is the typical quadripartite and circular structure that contains two IRs divided by LSC and SSC regions ( Figure 1). The guanine-cytosine (GC) content percentage of the intact H. myrtifolia cp genome was 37.0% (Table 1), which is lower than that of L. intermedia (37.6%) and Oenothera argillicola (39.1%).   In the H. myrtifolia cp genome, 112 total unique genes were detected, of which 17 are duplicated in the IRs ( Table 3). The 112 genes are divided into 30 tRNA genes, four rRNA genes, and 78 protein-coding genes. Among these 112 unique genes, three (clpP, rps12, and ycf3) contain two introns and 14 contain one intron (eight protein-coding genes and six tRNA genes) ( Table 4). The Rps12 gene is a trans-spliced with two C-terminal exons and one N-terminal downstream exon. The trnK-UUU gene in which the matK gene is located has the largest intron at 2497 bp.
By proportion, tRNAs, rRNAs, and proteins are encoded by 2.0, 3.0, and 51.0% of the H. myrtifolia cp genome, respectively ( Table 2). The remaining 49.0% of the H. myrtifolia cp genome belongs to non-coding regions, comprised of pseudo-genes, introns, and intergenic spacers ( Table 2). Protein-coding sequences account for 74,088 bp possessing 78 protein-coding genes coding for 27,453 codons (Table 3 and Table S1). Moreover, the AT content within protein-coding regions was 66.1%, 61.9%, and 58.7% at the first, second, and third codon positions, respectively (Table 5). At the third codon position, G and C nucleotides are enriched over A and T; a result consistent with those widely obtained in many other terrestrial plant cp genomes [23].

Codon Usage
Codon usage biases can have important ramifications for cellular function and reflect lineage specific translational systems thus providing additional means for studying speciation and evolution at the molecular level [24,25]. However, cp genomes, unlike nuclear genomes, do not appear to have synonymous codon usage bias associated with intron number or evolutionary specialization [26]; therefore, we examined codon usage to confirm this.
The frequency of codon usage was calculated for the H. myrtifolia cp genome based on the tRNAs and protein-coding genes. Tryptophan (1.5%) and leucine (11.6%) were the least-frequency and highest-frequency amino acids, respectively ( Figure 2). Among which, the least and most used were CGC (99) encoded arginine and AAA (1137) encoded lysine, respectively (Table S1). Significantly, as a synonym, almost each amino acid contains half of the codons, which ended with A or T (U) at high relative synonymous codon usage (RSCU) values and low RSCU values ended with G or C (Table S1). The composition bias with high A/T proportion codon usage patterns is generally semblable to those reported from other cp genomes [27].

Comparative Genomic Analysis of the cp Genomes in Myrtales
From the pairwise comparison of cp genomes, a high level of sequence similarity was found between H. myrtifolia and the 11 other Myrtales cp genomes. By using mVISTA, H. myrtifolia annotation was used as a reference to characterize differences between the 11 Myrtales species' cp genomes ( Figure 3). The results showed that the LSC and SSC regions are more divergent than the two IR regions. In addition, within the LSC and SSC regions, the non-coding regions are more divergent than the coding regions. The most highly differentiated regions including atpB, matK, ndhD, ndhF, ndhH, rpl22, rps15, ycf2, and trnH-psbA. Similar levels of divergence have been previously measured for these gene regions [28,29]. IR regions of all 12 cp genomes were highly conserved, including gene order and number, however, they showed significant differences at the junction of the single-copy regions. Neither inversions nor translocations were detected among these compared genomes. Variations of genome size, IR expansion, and contraction were the main structural differences detected within these 12 cp genomes.

Genome Size Differences between the 12 Myrtales cp Genomes
For genome size of the 12 Myrtales species examined, L. intermedia has the smallest cp genome size (152,330 bp) and Oenothera argillicola the largest (165,055 bp). The genome size variation is largely caused by differences in the intergenic regions (IGS), similar to other angiosperm cp genomes.

Contraction and Expansion of All Inverted Repeats (IRs)
In general, the sizes of IR regions differ between species (Table 1). The expansion and contraction between the two inverted repeats, LSC, and SSC boundary regions usually generates length variation of plant cp genomes [30]. Accurate SC-IR boundaries and their neighboring genes were compared among the 12 Myrtales cp genomes (Figure 4). Although the overall genomic structure was conserved, the 12 Myrtales cp genomes possessed differences at the SC-IR junction regions (Figure 4). The size of two IRs varied from 25,736 bp (L. intermedia) to 28,772 bp (O. argillicola), as did the four IR boundaries (J LA , J LB , J SA , and J SB ) [13] (Figure 4). The IR A -LSC boundary (J LA ) is nested in the rps19 coding gene in L. intermedia, A. ternata, O. argillicola, P. guajava, and S. quadrifida by 87 bp, 38 bp, 178 bp, 31 bp, and 38 bp, respectively, into the IR A region. However, in the remaining seven species, the J LA boundary nested in the intergenic region between rps19 and rpl2, in which the distances from rps19 to the J LA ranged from 2 to 240 bp. The IR A -SSC junction (J SA )is nested in the pseudogene ycf1 (ϕycf1) in L. intermedia (Figure 4). The J SA junction for eight of the 12 species (A. sellowiana, A. costata, C. eximia, E. aromaphloia, E. uniflora, Psidium guajava, S. quadrifida, and S. cumini) is located on the edge of ϕycf1. The J SA junction of A. ternata and O. argillicola was located in the range of ndhF, and J SA of H. myrtifolia is situated 1 bp from the end of ϕycf1.
The IR B -SSC boundary (J SB ) in 11 of the 12 species is nested in the ycf1 gene, which extended into IR B region, while in O. argillicola, the distance between J SB and the end edge of ycf1 was 257 bp. The IR B -LSC boundary (J LB ) was situated in the region between rpl2 and trnH in all of the species except S. quadrifida. In S. quadrifida, the trnH gene extends 5 bp into IR B (Figure 4). The IR-LSC boundary variation is likely the result of a series of two short direct repeats that are mediated by intramolecular recombination within the genes located at the borders [31]. As such, the IR-LSC boundary could be a highly informative region for population or phylogenetic studies.

Long Repeat Structure Analysis
Previous studies have shown that the genome rearrangement can occur from sliding and inappropriate combinations of repetitive sequences [32]. Long repetitive sequences have been highly valuable markers in the study of plant evolution, genome recombination studies, comparative genomics, and phylogenetics [33].
Comparison of forward, reverse, complement, and palindromic repeats (≥30 bp) were made among H. myrtifolia and 11 species using REPuter. In H. myrtifolia, 18 repeats including 15 forward, one palindromic, and two reverse type were found. A. ternata had the fewest (11) repeats with shortest genome size of 159,593 bp, which is inconsistent strictly with the rule of larger genome size possessing more repetitive repeats [34].
In total, 195 repeats in all 12 species were found ( Figure 5A). O. argillicola possessed the greatest number of repeats consisting of 22 forward repeats and one palindromic repeat as well as possessing the largest genome of those in this study ( Figure 5A and Table S2). In L. intermedia, A. sellowiana, A. costata, C. eximia, E. aromaphloia, E. uniflora, P. guajava, S. quadrifida, and S. cumini cp genomes, 20,16,18,20,13,15,13,16, and 12 long repeats were identified, respectively ( Figure 5A). The largest proportion of repeats (82.1%) varied from 30 bp to 40 bp in length ( Figure 5B and Table S2), while the range of repeats was from 94 bp to 30 bp per unit. Forward repeats are usually caused by transposon activity [35], which can correlate with enhanced cellular stress [36]. Forward repeats can cause variation in genome structure and consequently can be employed as markers in population genetic and phylogenetic studies [20].

Simple Sequence Repeat (SSR) Analysis
Simple sequence repeats (SSRs) in cp genomes have high copy number diversity and are thus very useful molecular markers for plant population genetics, breeding studies at the intraspecific level and evolutionary research [37]. In this study, the type, distribution, and number of SSRs were identified using the search criteria as follows: 10 repeats for mononucleotide, three repeats for dinucleotide, trinucleotide, tetranucleotide, pentanucleotide, and hexanucleotide among the cp genomes of 12 species.

Divergence Hotspots among Myrtales Species
The nucleotide diversity (Pi) values of the 12 species' cp genomes were computed separately for the IRs, LSC, SSC regions, and protein-coding genes including introns ( Figure 7A,B). The IGS regions were far more divergent than the protein-coding regions (CDS). In regard to the quadripartite subdivisions, the LSC and SSC are less divergent than IRs regions. Within the CDS regions, Pi values varied from 0.09 to 0.141 with an average value of 0.033 in the LSC region, the SSC region ranged from 0.028 to 0.137, with an average value of 0.051, and the IR region had values from 0.005 to 0.114 with an average value of 0.046. The five genes with the largest variability in CDS region were atpA, ccsA, rps12, ycf1, and rpl2 ( Figure 7A), and for the IGS regions, rps15-ycf1, rps4-trnT-UGU, trnK-UUU-rps16, trnG-UCC-trnR-UCU, and rpl32-trnL-UAG were the most variable ( Figure 7B). Some regions were uncharacteristically conserved with IGS regions trnI-GAU-trnA-UGC and the ndhB intron showing less variation than that of genes situated in the CDS region ( Figure 7B).

Phylogenetic Analysis of H. myrtifolia and Related Myrtales cp Genomes
In the past few decades, the method of constructing phylogenetic trees has been based on one or a few relatively short sequences [38]. However, due to lateral gene transfer, paralogy, and genetic evolution rate differences between groups, the phylogenetic tree based on a single or few genes cannot sufficiently represent phylogenetic relationships. The entire cp genome is being used more and more in plant phylogenetic and population genetics as large-scale DNA sequencing becomes more main stream and less expensive. Our phylogenetic tree showed that H. myrtifolia is most closely related to Lagerstroemia species based on the 68 shared protein-coding genes in the matrix (Figure 8). Through all three methods, the phylogenetic tree had very high bootstrap support for most branches. These results suggested that entire cp genome information may be useful when resolving phylogenetic relationship conflicts. However, phylogenetic analyses with many closely related species and populations are needed to thoroughly examine the resolving power of cp coding genes [13,39].

DNA Extraction of Plant Materials and Sequencing
Fresh leaves of H. myrtifolia (Lythraceae, Myrtales) were attained from Hangzhou Botanic Garden, Zhejiang Province (China), and were preserved immediately in silica gel. Genomic DNA was extracted employing a standard Cetyl trimethyl ammonium bromide (CTAB) protocol [40]. The concentration and quality of extracted DNA was evaluated using a NanoDrop 2000 Micro spectrophotometer and an Agilent 2100 Bioanalyzer (Agilent Technologies, Santa Clara, CA, USA).
A sequence library was constructed using purified DNA following the manufacturer's instructions. Using an Illumina HiSeq 2000 sequencer (Illumina Biotechnology company, San Diego, CA, USA), approximately 41,103,536 raw reads were obtained with paired-end (PE) 150 bp length reads.

Chloroplast Genome Assembly, Annotation, and Structure
Using Trimmomatic v0.3, raw reads with a Phred Quality Score of 20 or less were trimmed and filtered [41] using the following settings: sliding window: 4:15, trailing: 3, leading: 3, and minlen: 50. First, the CLC Genomics Workbench v7.0 (Qiagen Company, Hilden, Germany) was employed to carry out de novo assembly with the default parameters [13]. Second, using the Lagerstroemia fauriei cp genome as a reference, all contigs were aligned using BLAST software on the NCBI website to generate the complete cp genome.
Genome annotation was performed for the ribosomal RNAs (rRNAs), transfer RNAs (tRNAs), and protein-coding genes using DOGMA v1.2 [42]. The start and stop codons and the exon-intron boundaries of genes were precisely manually confirmed using published cp genomes [39]. Draft annotations were subsequently examined and manual adjustments were made with alignments to related species L. fauriei [13]. BLASTN searches in the NCBI website were used to identify and confirm both tRNA and rRNA genes. Lastly, further verification of the tRNA genes was carried out with tRNAscan-SE v1.21 [43]. The final cp genome physical map was drawn using OGDraw software [44].

Codon Usage
In order to detect the deviation in the use of synonymous codons, the relative synonymous codon usage (RSCU) was used to examine the effect of amino acid composition as calculated by MEGA 6 [45]. The RSCU is a simple method to determine synonymous codon inconsistencies in coding sequences. The RSCU value is the relative probability for a specific codon when translating the corresponding amino acid and it removes the effect of the amino acid composition on the use of the codon. An RSCU of >1.00 denotes codons are used more frequently than expected, while an RSCU of <1.00 denotes a codon is being applied less frequently than expected.

Genome Comparative Analysis and Molecular Marker Identification
We downloaded Lagerstroemia intermedia, Acca sellowiana, Angophora costata, Allosyncarpia ternata, Corymbia eximia, Eucalyptus aromaphloia, Eugenia uniflora, Oenothera argillicola, Psidium guajava, Syzygium cumini, and Stockwellia quadrifida cp genomes from GenBank (GenBank accession numbers in Tables 1 and 2), as a set to compare cp genomes in the Myrtales. Using the annotation of H. myrtifolia as the reference, pairwise alignments among 12 cp genomes in the Myrtales were conducted using LAGAN mode in the mVISTA program [46].
In order to assess the different evolutionary patterns in Myrtales and detect the highly informative regions, we extracted both intergenic regions and protein-coding regions after alignment using MEGA 6. The two-standard cutoff was used wherein at least one mutation site must be present and the aligned length is >200 bp. The nucleotide diversity (Pi) of these regions was calculated using DNaSP V5.10 [47].

IR Expansion and Contraction of cp Boundaries
Genome differences between species are often found at the LSC and SSC junctions with the two reverse duplicate regions (IR A and IR B ). There are four boundaries (JL A , JL B , JS A , and JS B ) in the cp genome between the two IRs and the LSC and SSC regions [30]. The precise IR expansion and contraction with the boundary genes among H. myrtifolia and the 11 other Myrtales species were compared in this study.

Phylogenetic Analysis
To analyze the phylogenetic placement of H. myrtifolia, 68 common protein-coding genes of the cp genomes from 29 species were employed including 6 outgroup species from Geraniaceae (Erodium carvifolium, Erodium crassifolium, Monsonia speciosa, Pelargonium alternans, Pelargonium x hortorum, and Geranium palmatum (GenBank accession numbers of species in Table S5). With the Clustal X default parameters, alignments were conducted to retain the reading frames accompanied by manual correction [50]. The data matrix used in the phylogenetic analyses is attached as supplemental data (Supplementary Materials). The phylogenetic tree based on these 68 concatenated genes was constructed using three phylogenetic-inference methods: maximum-likelihood (ML) using PHYML v 2.4.5 [51], Bayesian inference (BI) using MrBayes 3.1.2 [52] and parsimony analysis using PAUP* 4.0b10 [53] employing the settings from [13].

Conclusions
By adopting high coverage Illumina sequencing, we completed the H. myrtifolia cp genome and deposited the sequence into GenBank (Accession number: MG921615). The general genome structure, gene number, and gene content of H. myrtifolia were similar with all other cp genomes from Myrtales. However, numerous differences were found between the 12 species examined that are useful markers for studies in molecular evolution of cp genomes. The cp genome information of H. myrtifolia is a useful genetic resource that could be applied to population genomic studies for Lythraceae species and help elucidate genomic patterns and the evolutionary history in the group more broadly.
Supplementary Materials: Supplementary materials will be available online.