Comparison among the first representative chloroplast genomes of Orontium, Lasia, Zamioculcas, and Stylochaeton of the plant family Araceae: inverted repeat dynamics are not linked to phylogenetic signaling

The chloroplast genome provides insight into the evolution of plant species. We de novo assembled and annotated chloroplast genomes of the first representatives of four genera representing three subfamilies: Lasia spinosa (Lasioideae), Stylochaeton bogneri, Zamioculcas zamiifolia (Zamioculcadoideae), and Orontium aquaticum (Orontioideae), and performed comparative genomics using the plastomes. The size of the chloroplast genomes ranged from 163,770–169,982 bp. These genomes comprise 114 unique genes, including 80 protein-coding, 4 rRNA, and 30 tRNA genes. These genomes exhibited high similarities in codon usage, amino acid frequency, RNA editing sites, and microsatellites. The junctions JSB (IRb/SSC) and JSA (SSC/IRa) are highly variable, as is oligonucleotide repeats content among the genomes. The patterns of inverted repeats contraction and expansion were shown to be homoplasious and therefore unsuitable for phylogenetic analyses. Signatures of positive selection were shown for several genes in S. bogneri. This study is a valuable addition to the evolutionary history of chloroplast genome structure in Araceae.


Introduction
The chloroplast is an important double membrane bounded organelle that plays a crucial role in photosynthesis and in the synthesis of fatty acids and amino acids [1]. The chloroplast contains its own DNA, and replicates independently from the nuclear genome [1,2].
Chloroplast genomes mostly exhibit a quadripartite structure in which a pair of inverted repeats (IRa and IRb) separate large single copy (LSC) and small single copy (SSC) regions [1][2][3][4]. However, in some plant lineages, quadripartite structure is not observed due to the loss of one or two inverted repeats (IRs), for example in Taxodiaceae [5] and Fabaceae [6]. On the other hand, very short IRs are also reported, for example in Pinaceae [7]. Moreover, a mixture of linear and circular chloroplast genomes has also been observed [8].
The structure of chloroplast genomes is conserved regarding gene organization, gene content and intron content [1,[9][10][11][12]. However, large scale events of gene rearrangement, gene structure and evolution is still lacking for several major aroid clades, including the subfamilies Lasioideae and Zamioculcadoideae. Hence, new genomic resources are required to provide better insight into the evolution of chloroplast genomes in Araceae.
In the current study, we report de novo assembled and fully annotated chloroplast genomes of four species from three different subfamilies: Lasia spinosa (L.) Thwaites (Lasioideae), Stylochaeton bogneri Mayo and Zamioculcas zamiifolia (Lodd.) Engl. (Zamioculcadoideae), and Orontium aquaticum L. (Orontioideae). We performed comparative chloroplast genomics among these species. This study provides the first insights into the chloroplast genome structure of four genera of Araceae, along with the evolutionary rate of protein-coding genes and types of substitutions occurring within these chloroplast genomes.

Sample collection, DNA extraction and sequencing
We collected fresh and healthy leaves of four species (L. spinosa, S. bogneri, Z. zamiifolia, and O. aquaticum) from the Araceae Greenhouse at the Missouri Botanical Garden in St.
Louis, Missouri. Whole genomic DNA was extracted from the collected leaves using Qiagen DNeasy Minikit (Qiagen, Germantown, Maryland, USA) with some modification following a previous approach [10,18]. DNA quality and quantity were confirmed by 1% agarose gel electrophoresis and Nanodrop (ThermoScientific, Delaware, USA). The libraries were constructed following manufacturer's protocol of Illumina TruSeq kits (Illumina, Inc., San Diego, California) in the Pires laboratory at the University of Missouri, Columbia. The Illumina HiSeq 2000 platform was used to sequence qualified libraries from single end with 100 bp short reads at the University of Missouri DNA Core.

Genome assembly and annotation
The sequencing of these genomes generated 3.31 GB (S. bogneri) to 11.3 GB (Z. zamiifolia) of raw data ( Table 1). The quality of the generated short read data was compared among species using FastQC and MultiQC [35,36]. The analyses confirmed high quality of the data with high average Phred score ranging from 35.69-37.6. The raw data of the sequenced four species were submitted to the Sequence Read Archive of the National Center for Biotechnology (NCBI) under SRA project number PRJNA613281. The generated sequence data were used to de novo assemble chloroplast genomes using Velvet v. 1 [38] following previous studies [9,11,39]. The coverage depth analysis was performed by mapping the short reads to their respective de novo assembled chloroplast genomes by BWA mem [40]. The assembly of the genomes was then validated by visualizing in Tablet [41]. We observed issues at 4-5 points of repetitive regions, therefore, for further validation we used Fast-Plast v.1.2.2 following exactly the same procedure employed for the assembly of other Araceae species [10,18]. This helped us to corroborate the correct sequence at those points. The coverage depth analyses revealed that the average coverage depths of the genomes ranged from 92.7X-1021X. The de novo assembled chloroplast genomes were annotated by GeSeq [42], whereas tRNA genes were further verified by tRNAscan-SE v.2.0.3 [43] and ARAGORN v.1.2.38 [44] by selecting default parameters. The final annotated genomes were submitted to NCBI under specific accession numbers (Table 1). GB2sequin was used to generate five column tab-delimited files from the annotated genomes for NCBI submission [45]. The circular map of these genomes was drawn by OrganellarGenomeDRAW (OGDRAW) [46].

Characterization, comparative analyses, and phylogenetic inference
We used Geneious R8.1 [38] to compare genomic features and determine amino acid frequency and codon usage. To visualize and compare the junctions of chloroplast genomes, we used IRscope with default parameters [47]. The integrated Mauve alignment [48] of Geneious R8.1 was used to analyze gene arrangement based on Colinear block analyses after removal of IRa from the genomes.
The Predictive RNA editors for Plant (PREP-CP) [49] program was used to determined RNA editing sites in the chloroplast genomes. We also analyzed microsatellites and oligonucleotide repeats using MISA (MIcroSAtellite) and REPuter, respectively. We determined microsatellites with repeat units as: mononucleotide repeats of the same chloroplast genome. The GC content of coding regions, rRNAs, and tRNAs also showed high similarities among species (Table 2).

Inverted repeats contraction and expansion
The chloroplast genomes showed variations and similarities at the junctions of LSC/IR/SSC.

Analyses of codon usage, amino acid frequency and RNA editing
The codon usage analyses revealed high encoding efficacy for those codons that end with A/T as opposed to codons that end with C/G. We recorded a relative synonymous codon usage (RSCU) value ≥ 1 for most codons that end with A/T, whereas RSCU < 1 was recorded for codons that end with C/G (Table S3). The ATG codon is the most common start codon.
However, we also observed ACG (in rpl2) and GTG (in rps19) as start codons. The amino acid frequency analyses revealed high encoding of leucine, whereas the rarest encoding was recorded for cysteine ( Figure S1). The RNA editing analyses revealed the presence of 62-74 RNA editing sites in 19-21 genes (Table S4). The RNA editing sites were found in the same genes with a few exceptions: RNA editing sites were detected in psaB genes of only Z. zamiifolia, whereas the RNA editing site was not found in rpl20 of S. bogneri and rps8 of L.
spinosa. Most of the RNA editing sites were recorded in ndhA, ndhB, ndhD, rpoA, rpoB, rpoC1, and rpoC2 (Table S4). ACG was found as a start codon in gene rpl2 and the RNA editing analyses confirmed conversion of ACG codon to ATG. Most RNA editing sites were found to be related to conversion of serine to leucine. Moreover, almost all editing sites led to accumulation of hydrophobic amino acids in the polypeptide chain of proteins (Table S4).

Repeats analyses
The analyses of microsatellites revealed 104-146 repeats in the genomes. Most of the repeats existed in LSC followed by SSC and then IR (Fig. 4a). We recorded an abundance of mononucleotide repeats in all species, especially Z. zamiifolia. Dinucleotide repeats were in greater abundance in O. aquaticum and L. spinosa, whereas S. bogneri and Z. zamiifolia showed an abundance of mononucleotide repeats and tetranucleotide repeats. Lasia spinosa, Z. zamiifolia and O. aquaticum showed similarities in numbers of tri and tetranucleotide repeats in their respective genomes, but S. bogneri showed a lack of trinucleotide repeats relative to tetranucleotide repeats (Fig. 4b). Pentanucleotide and hexanucleotide repeats were in lower abundance than the other types of repeats and were completely lacking in Z. zamiifolia (Fig. 4a). Most repeats of all six microsatellite types were the A/T rather than G/C motif (Table S5). The analyses of oligonucleotide repeats revealed the existence of a higher number of forward and reverse repeats in all four species. We recorded most repeats in the LSC region as compared to the SSC and IR regions. We also found some shared repeats  Table S6.

Analyses of substitutions types
We recorded a greater number of Ts than Tv substitutions. The ratio of Ts/Tv was 2.3, 2.03, and 2.15 in the genomes of L. spinosa, S. bogneri, and Z. zamiifolia, respectively. The majority of Ts were promoted by A/G rather than C/T, whereas the majority of Tv were found to be related to A/C and G/T rather than A/T and C/G (Table 3). For K s and K a , we found a higher average of K s than K a . Hence, on average, we recorded very low K a /K s for all genes, which shows that purifying selection is acting on these genes. The average values recorded for the different groups of genes were: photosystem I group (K s = 0.1677, K a = 0.0125, and K a /K s = 0.1211), photosystem II (K s = 0.1208, K a = 0.0085, and K a /K s = 0.0671), cytochrome group (K s = 0.1757, K a = 0.0298, and K a /K s = 0.2012), ATP synthase group (K s = 0.1466, K a = 0.0188, and K a /K s = 0.1337), ribosomal small subunit group (K s = 0.1620, K a = 0.0633, and K a /K s = 0.3491), ribosomal large subunit (K s = 0.1639, K a = 0.0677, and K a /K s = 0.4612), NADPH dehydrogenase group (K s = 0.1711, K a = 0.0781, and K a /K s = 0.3935), and RNA polymerase group (K s = 0.1813, K a = 0.3228, K a /K s = 0.1800) (Table S7). Some genes showed neutral selection (K a /K s = 1) in all species, including ndhK, petL, rpl16, ndhF, ndhH, and rps15. Interestingly, we found evidence for positive selection in three genes (ycf2, clpP, rpl36) in only S. bogneri (Table S7).

Phylogenetic inference of Araceae
A phylogenetic tree was reconstructed using 31 species based on an alignment of 93,794 nucleotide sites, of which 13,488 were found to be parsimony informative; 8,843 showed distinct site pattern while the remaining sites (67,229 sites) were shared in all species. The resulting phylogeny shows the monophyly of Lasioideae, Zamioculcadoideae and Orontioideae, with Zamioculcadoideae forming a clade with Stylochaeton.

Discussion
In the current study, we report de novo assembled and fully annotated chloroplast genomes of four species from three subfamilies of Araceae. Comparative chloroplast genomics revealed high similarities in gene content across all species. However, the sizes of these genomes varied due to the variable length of intergenic spacer regions (IGS) and IRs contraction and expansion. The substitution analyses revealed Ts > Tv and K s > K a . The phylogenetic analysis confirmed the monophyly of Lasioideae, Orontioideae and Zamioculcadoideae.
The chloroplast genomes of the four species showed a highly conserved structure in terms of gene content, intron content and gene organization. Similar gene contents were also reported in other subfamilies of Araceae [10,18,28,34]. Moreover, highly conserved chloroplast genomes are reported in other angiosperms [10,18,55]. However, loss of some important protein-coding genes and tRNA genes has been reported in the genus Amorphophallus (Aroideae, Araceae), which might be specific to this genus. The infA gene encodes translation initiation factor I, but we found this gene to be non-functional in all species. This gene is also reported to be non-functional or absent in the chloroplast genomes of other angiosperms including species of Araceae [9][10][11]18,28,55]. Hence, it is suggested that this gene is either transferred to the nuclear genome as an active functional gene, or a functional copy of this important gene might already exist in the nuclear genome [31,56]. We observed duplication of ycf1 genes or origination of pseudogenes of ycf1 and rps15 due to IR contraction and expansion. The duplication of ycf1 and/or rps15 is also reported in species of the subfamily Lemnoideae [28] and two species (Anchomanes hookeri Schott. and Zantedeschia aethiopica Spreng.) of Aroideae [10].
Together with previously published aroid chloroplast genomes, the chloroplast genomes of the current study reveal that IR contraction and expansion might be a species-specific event, opposed to synapomorphies of subfamilies. The duplication of ycf1 and origination of the rps15 pseudogene in O. aquaticum is not observed in species of Symplocarpus Salisb. ex Nutt. [34], and both genes are found in the SSC region. The complete duplication of ycf1 and rps15 is observed in Lemnoideae species [28]; in S. bogneri and Z. zamiifolia, duplication of partial ycf1 is observed, and rps15 completely exists in the SSC region. Moreover, complete duplication of ycf1 and partial duplication of rps15 was observed in L. spinosa, similar to O.
aquaticum. This data suggests that IR contraction and expansion is highly flexible over evolutionary time, and that similar IR boundary architecture across lineages can be the result of homoplasy. In other angiosperms, differential IR contraction and expansion has also been seen in species of the same genus such as Aquilaria Lam. [57,58]. This observation is contradictory to a previous study in which resemblance of IR junctions was suggested for phylogenetic inference [59]. However, these authors did not provide conclusive results and suggested further study in wide range of germplasms.
The high RSCU value has also been previously reported for those codons that contain A/T at 3′ end instead of C/G, which might be due to the high AT content of the chloroplast genomes [3,11,18,55]. We found an abundance of leucine and isoleucine, whereas cysteine was found to be the rarest amino acid. These results are in agreement with previous studies of angiosperms, including the family Araceae [3,9,16,18,28]. We found high similarities in RNA editing sites. Most of the conversions by RNA editing led to the addition of hydrophobic amino acids to polypeptide chains of proteins. A similar pattern of conversion has been noted in the chloroplast genomes of other angiosperms [11,16,19]. We found ACG as a start codon for rpl2 instead of ATG, and RNA editing analyses confirmed the conversion of ACG to ATG. This is in agreement with a previous study [60].
Oligonucleotide repeats generate mutations in genomes [11,31,50,61]. Hence, these repeats are suggested to be used as a proxy to identify mutational hotspots [11,31,50]. In the current study, we analyzed forward and reverse repeats, as these repeats are considered important for the generation of mutations [11,31,50] and show up to 90% co-occurrence with substitutions as reported in the plant family Malvaceae [50]. Previously, high similarities were reported in repeat numbers and types in some angiosperms [3,4,9,11,19,55]. However, previous studies also show significantly variable number of repeats in other species of Araceae, which does not correlate with genome size or phylogenetic position [10,18]. Here, our result agrees with previous studies of Araceae. Notably, penta-and hexanucleotide repeats were completely absent in Z. zamiifolia, while the number of mononucleotide repeats was almost twice as high as in the other genera. In the current study, we changed the criteria for repeat determination following a recent study of Abdullah et al. 2020 [50], and identified repeats of exact match ≥ 14. However, our result still agrees with a previous study in that repeats in the chloroplast genomes of Araceae are independent of genome size and phylogenetic position.
We observed a greater amount of Ts than Tv substitutions within the protein-coding genes , as expected in DNA sequences [62]. However, fewer Ts than Tv has also been reported previously in chloroplast genomes [9,39,63]. Higher Ts than Tv might occur due to genome composition and codon characteristics [64]. Moreover, higher Ts than Tv was also reported in the complete chloroplast genome of Dioscorea polystachya Turcz. [65] and in the coding sequences of the species of Lemnoideae (Araceae). This suggests that species of the plant family Araceae are consistent regarding the existence of higher Ts as compared to Tv.
We observed K a /K s < 1 due to higher K s than K a for most of the protein-coding genes. These results are consistent with previous studies of angiosperm chloroplast genomes, including the family Araceae, as purifying selection pressure mostly acts on the genes of chloroplast