Comparative Genomics of the Balsaminaceae Sister Genera Hydrocera triflora and Impatiens pinfanensis

The family Balsaminaceae, which consists of the economically important genus Impatiens and the monotypic genus Hydrocera, lacks a reported or published complete chloroplast genome sequence. Therefore, chloroplast genome sequences of the two sister genera are significant to give insight into the phylogenetic position and understanding the evolution of the Balsaminaceae family among the Ericales. In this study, complete chloroplast (cp) genomes of Impatiens pinfanensis and Hydrocera triflora were characterized and assembled using a high-throughput sequencing method. The complete cp genomes were found to possess the typical quadripartite structure of land plants chloroplast genomes with double-stranded molecules of 154,189 bp (Impatiens pinfanensis) and 152,238 bp (Hydrocera triflora) in length. A total of 115 unique genes were identified in both genomes, of which 80 are protein-coding genes, 31 are distinct transfer RNA (tRNA) and four distinct ribosomal RNA (rRNA). Thirty codons, of which 29 had A/T ending codons, revealed relative synonymous codon usage values of >1, whereas those with G/C ending codons displayed values of <1. The simple sequence repeats comprise mostly the mononucleotide repeats A/T in all examined cp genomes. Phylogenetic analysis based on 51 common protein-coding genes indicated that the Balsaminaceae family formed a lineage with Ebenaceae together with all the other Ericales.


Introduction
The family Balsaminaceae of the order Ericales contains only two genera, Impatiens Linnaeus (1753:937) and Hydrocera Wight and Arnott (1834:140) and are predominantly perennial and annual herbs [1]. The monotypic genus Hydrocera, with a single species Hydrocera triflora, is characterized by actinomorphic flowers, a pentamerous calyx and corolla without any fusion between perianth parts, contrary to highly similar sister genus Impatiens whose flowers are highly zygomorphic [2]. Impatiens, one of the largest genera in angiosperms, consists of over 1000 species [3][4][5][6] primarily distributed in the Old World tropics, subtropics and temperate regions, but also in Europe, and central and North America [5,7]. In contrast, the sister Hydrocera, which is a semi-aquatic plant, is restricted to the lowlands of Indo-Malaysia [1]. Besides, the geographical regions, including south-east Asia, the eastern Himalayas, tropical Africa, Madagascar, southern India and Sri Lanka occupied by Impatiens, have been identified as diversity hotspots [7,8]. Recently, numerous new species have been recorded within these regions each year [9][10][11][12][13][14].
The controversial nature of classification of the genus Impatiens [1,15], for example different floral characters, its hybridization nature and species radiation, has made it under-studied. The species in prolific genus Impatiens are economically used as ornamentals, medicinal, as well as experimental research plant materials [16]. Additionally, previous studies have shown the genus Impatiens to possess potential anticancer compounds by decreasing patients' cancer cell count and increasing their life span and body weight [17]. The glanduliferins A and B isolated from the stem act to inhibit the growth of human cancer cells for growth inhibitory activity of human cancer cells [18]. As well, some polyphenols from Impatiens stems have showed antioxidant and antimicrobial activities [19].
In angiosperms, the chloroplast genome (cp) typically has a quadripartite organization consisting of a small single copy (SSC, 16-27 kb) and one large single copy (LSC) of about 80-90 kb long separated by two identical copies of inverted repeats (IRs) of about 20-88 kb with the total complete chloroplast genome size ranging from 72 to 217 kb [20][21][22]. Most of the complete cp genomes contains 110-130 distinct genes, with approximately 80 genes coding for proteins, 30 tRNA and 4 rRNA genes [21]. In addition, due to the highly conserved gene order and gene content, they have been used in plant evolution and systematic studies [23], determining evolutionary patterns of the cp genomes [24], phylogenetic analysis [25,26], and comparisons of angiosperm, gymnosperm, and fern families [27]. Moreover, the cp genomes are useful in genetic engineering [28], phylogenetics and phylogeography of angiosperms [29], and estimation of the diversification pattern and ancestral state of the vegetation within the family [30].
The Ericales (Bercht and Presl) form a well-supported clade (Asterid) containing more than 20 families [31]. Up to now, complete cp genomes representing approximately half of the families in the order Ericales have been sequenced including: Actinidiaceae [32,33], Ericaceae [34,35], Ebenaceae [36], Sapotaceae [37], Primulaceae [38,39] Styracaceae [40], and Theaceae, Pentaphylacaceae, Sladeniaceae, Symplocaceae, Lecythidaceae [30]. In addition the Impatiens and Hydrocera intergeneric phylogenetic relationship has been done using chloroplast atpB-rbcL spacer sequences [4]. However, there are no reports of complete chloroplast genomes in the family Balsaminaceae to date. This limitation of genetic information has hindered the progress and understanding in taxonomy, phylogeny, evolution and genetic diversity of Balsaminaceae. Analyses of more cp genomes are needed to provide a robust picture of generic and familial relationships of families in order Ericales.
This study aims to determine the complete sequences of the chloroplast genomes of I. pinfanensis (Hook. f.) and H. triflora using a high-throughput sequencing method. Additionally, comparisons with other published cp genomes in the order Ericales will be made in order to determine phylogenetic relationships among the representatives of Ericales.

The I. pinfanensis and H. triflora Chloroplast Genome Structure and Gene Content
The complete chloroplast genomes of I. pinfanensis and H. triflora share the common feature of possessing a typical quadripartite structure composed of a pair of inverted repeats (IRs) separating a large single copy (LSC) and a small single copy (SSC), similar to other angiosperm cp genomes [23].  18 18 Like in typical angiosperms, both I. pinfanensis and H. triflora cp genomes encode 115 total distinct genes of which 80 are protein coding, 31 distinct tRNA and four distinct rRNA genes. Of these 62 genes coding for proteins and 23 tRNA genes were located in the LSC region, seven protein-coding genes, all the four rRNA genes and seven tRNA genes were replicated in the IR regions, while the SSC region was occupied by 11 protein-coding genes and one tRNA gene. The ycf1 gene was located at the IR and SSC boundary region (Figures 1 and 2 18 18 Like in typical angiosperms, both I. pinfanensis and H. triflora cp genomes encode 115 total distinct genes of which 80 are protein coding, 31 distinct tRNA and four distinct rRNA genes. Of these 62 genes coding for proteins and 23 tRNA genes were located in the LSC region, seven proteincoding genes, all the four rRNA genes and seven tRNA genes were replicated in the IR regions, while the SSC region was occupied by 11 protein-coding genes and one tRNA gene. The ycf1 gene was located at the IR and SSC boundary region (Figures 1 and 2).   Among the 115 unique genes in I. pinfanensis and H. triflora cp genomes, 14 genes contain one intron, comprised of eight genes coding for proteins (atpF, rpoC1, rpl2, petB, rps16, ndhA, ndhB, ndhK) and six tRNAs (trnL-UAA, trnV-UAC, trnK-UUU, trnI-GAU, trnG-GCC and trnA-UGC) ( Table 2), while ycf3, clpP and rps12 genes each contain two introns. These genes have maintained intron content in other angiosperms. The trans-splicing gene rps12 has its 5′exon located in LSC, whereas the 3′exon is located in the IRs, which is similar to that in Diospyros species (Ebenaceae) [36,41] and Actinidia chinensis (Actinidiaceae) [41]. Oddly, rps19 and ndhD genes in both species begin with uncommon start codons GTG and ACG respectively, which is consistent with previous reports in other plants [36]. However, the standard start codon can be restored through RNA editing process [42,43].
The complete cp genome of I. pinfanensis and H. triflora were found to be similar, although some slight variations such as genome size, gene loss and IR expansion and contraction factors were detected, despite the two species being from the same family Balsaminaceae. For instance, H. triflora cp genome is 1951 bp smaller than that of sister species I. pinfanensis. The SSC region of I. pinfanensis is shorter (17,611 bp) compared to that of H. triflora, which is 18,082 bp long. The GC content of H. triflora is slightly higher (36.9%) than that of I. pinfanensis (36.8%). Both species possess highest GC values in the IR regions (43.1%) compared to LSC and SSC region showing the lowest values (34.5%/34.7% and 29.3%/29.9%) respectively. The IR region is more conserved than the single copy region (SSC) in both species, due to presence of conserved rRNA genes in the IR region, which is also the reason for its high GC content. Both cp genomes are AT-rich with the genome organization and content of the two species almost the same and highly conserved, these results are similar to those of other recently published Ericales chloroplast genomes [34,36]. Among the 115 unique genes in I. pinfanensis and H. triflora cp genomes, 14 genes contain one intron, comprised of eight genes coding for proteins (atpF, rpoC1, rpl2, petB, rps16, ndhA, ndhB, ndhK) and six tRNAs (trnL-UAA, trnV-UAC, trnK-UUU, trnI-GAU, trnG-GCC and trnA-UGC) ( Table 2), while ycf3, clpP and rps12 genes each contain two introns. These genes have maintained intron content in other angiosperms. The trans-splicing gene rps12 has its 5 exon located in LSC, whereas the 3 exon is located in the IRs, which is similar to that in Diospyros species (Ebenaceae) [36,41] and Actinidia chinensis (Actinidiaceae) [41]. Oddly, rps19 and ndhD genes in both species begin with uncommon start codons GTG and ACG respectively, which is consistent with previous reports in other plants [36]. However, the standard start codon can be restored through RNA editing process [42,43].
The complete cp genome of I. pinfanensis and H. triflora were found to be similar, although some slight variations such as genome size, gene loss and IR expansion and contraction factors were detected, despite the two species being from the same family Balsaminaceae. For instance, H. triflora cp genome is 1951 bp smaller than that of sister species I. pinfanensis. The SSC region of I. pinfanensis is shorter (17,611 bp) compared to that of H. triflora, which is 18,082 bp long. The GC content of H. triflora is slightly higher (36.9%) than that of I. pinfanensis (36.8%). Both species possess highest GC values in the IR regions (43.1%) compared to LSC and SSC region showing the lowest values (34.5%/34.7% and 29.3%/29.9%) respectively. The IR region is more conserved than the single copy region (SSC) in both species, due to presence of conserved rRNA genes in the IR region, which is also the reason for its high GC content. Both cp genomes are AT-rich with the genome organization and content of the two species almost the same and highly conserved, these results are similar to those of other recently published Ericales chloroplast genomes [34,36]. Table 2. Genes encoded in the Impatiens pinfanensis and Hydrocera triflora Chloroplast genomes.

Codon Usage
The relative synonymous codon usage (RSCU) has been divided into four models, i.e., RSCU value of less than 1.0 (lack of bias), RSCU value between 1.0 and 1.2 (low bias), RSCU value between 1.2 and 1.3 (moderately bias) and RSCU value greater than 1.3 (highly bias) [44,45]. To determine codon usage, we selected 52 shared protein-coding genes between I. pinfanensis and H. triflora with length of >300 bp for calculating the effective number of codons. As shown in (Table 3), the relative synonymous codon usage (RSCU) and codon usage revealed biased codon usage in both species with values of 30 codons showing preferences (<1) except tryptophan and methionine, with 29 having A/T ending codons. The TAA stop codon was found to be preferred. All the protein-coding genes contained 22,900 and 22,995 codons in I. pinfanensis and H. triflora cp genomes respectively. In addition, our results indicated that 2408 and 2439 codons encode leucine while 253 and 259 encode cysteine in I. pinfanensis and H. triflora cp genomes as the most and least frequently universal amino acids respectively. The Number of codons (Nc) of the individual PCGs varied from petD (37.10) to ycf3 (54.84) and rps18 (32.11) to rpl2 (54.24) in I. pinfanensis and H. triflora respectively (Table S1). Like recently reported in cp genomes of higher plants, our study showed that there was bias in the usage of synonymous codons except tryptophan and methionine. Our result is in line with previous findings of codon usage preference for A/T ending in other land plants [46,47].

SSR Analysis Results
Analysis of SSR occurrence using the microsatellite identification tool (MISA) detected Mono-, di-, tri-, tetra-, penta-and hexa-nucleotides categories of SSRs in the cp genomes of eight Ericales. A total of 197 and 159 SSRs were found in the I. pinfanensis and H. triflora cp genomes respectively. Not all the SSR types were identified in all the species, Penta and hexanucleotide repeats were not found in I. pinfanensis, Diospyros lotus, and Pouteria campechiana, while only hexanucleotides were not identified in Ardisia polysticta and Barringtonia fusicarpa (Table 4). Among the SSR types discovered mononucleotide repeat units were highly represented, which were found 180 and 141 times in I. pinfanensis and H. triflora respectively. Most of the mononucleotide repeats consisting of A or T were most common (117-176 times), whereas C/G were less in number (1-8 times), and all the dinucleotide repeat sequences in all the species were AT repeats. This result is consistent with previous reports, which showed most angiosperm cp genome to be AT-rich [36,38,48].  Di AT/AT

Selection Pressure Analysis of Evolution
The ratio of Synonymous (Ks) and non-synonymous (Ka) Substitution can determine whether the selection pressure has acted on a particular protein-coding sequence. Eighty common protein-coding genes shared by I. pinfanensis and H. triflora genomes were used. As suggested by Makałowski and Boguski [49] the Ka/Ks values are less than one in protein-coding genes as a result of less frequent non-synonymous (Ka) nucleotide Substitutions than the Synonymous (Ks) substitutions (Table S2). We found that the Ka/Ks values of the two species were low (<1) approaching zero, except for one gene psbK found in the LSC region, which has a ratio of 1.0259 (Figure 3). This indicates a negative selection all genes except psbK gene and shows that the protein-coding genes in both species are quite highly conserved (Table S2). The LSC, SSC, and IR regions average Ks values between the two species were 0.0995, 0.0314, and 0.1334 respectively. Based on Ka/Ks comparison among the regions, only ycf1 gene in IR region and most of the genes in the LSC and SSC regions revealed higher Ks values. The higher Ks values signaled that on average more genes found in the SSC region have experienced higher selection pressures in contrast to other cp genome regions (LSC and IR). The non-synonymous (Ka) value varied from 0.005 (psbE) to 0.0927 (ycf1) while Ks ranged from 0.058 (psbN) to 0.2944 (ndhE). Based on sequence similarity among the IR, SSC and LSC regions, the IR region was more conserved. This is in agreement with previous reports that found out that IR region diverged at a slower rate than the LSC and SSC regions as a result of frequent recombinant events taking place in IR region leading to selective constraints on sequence homogeneity [50,51].

Selection Pressure Analysis of Evolution
The ratio of Synonymous (Ks) and non-synonymous (Ka) Substitution can determine whether the selection pressure has acted on a particular protein-coding sequence. Eighty common proteincoding genes shared by I. pinfanensis and H. triflora genomes were used. As suggested by Makałowski and Boguski [49] the Ka/Ks values are less than one in protein-coding genes as a result of less frequent non-synonymous (Ka) nucleotide Substitutions than the Synonymous (Ks) substitutions (Table S2). We found that the Ka/Ks values of the two species were low (<1) approaching zero, except for one gene psbK found in the LSC region, which has a ratio of 1.0259 (Figure 3). This indicates a negative selection all genes except psbK gene and shows that the protein-coding genes in both species are quite highly conserved (Table S2). The LSC, SSC, and IR regions average Ks values between the two species were 0.0995, 0.0314, and 0.1334 respectively. Based on Ka/Ks comparison among the regions, only ycf1 gene in IR region and most of the genes in the LSC and SSC regions revealed higher Ks values. The higher Ks values signaled that on average more genes found in the SSC region have experienced higher selection pressures in contrast to other cp genome regions (LSC and IR). The non-synonymous (Ka) value varied from 0.005 (psbE) to 0.0927 (ycf1) while Ks ranged from 0.058 (psbN) to 0.2944 (ndhE). Based on sequence similarity among the IR, SSC and LSC regions, the IR region was more conserved. This is in agreement with previous reports that found out that IR region diverged at a slower rate than the LSC and SSC regions as a result of frequent recombinant events taking place in IR region leading to selective constraints on sequence homogeneity [50,51].

IR Expansion and Contraction
Despite of the highly conserved nature of the angiosperms inverted repeat (IRa/b) regions, the contraction or expansion at the IR junction are the usual evolutionary events resulting in varying cp genome sizes [52,53]. In our study, the IR/SSC and IR/LSC borders of I. pinfanensis and H. triflora were compared to those of the other six Ericales representatives (P.

IR Expansion and Contraction
Despite of the highly conserved nature of the angiosperms inverted repeat (IRa/b) regions, the contraction or expansion at the IR junction are the usual evolutionary events resulting in varying cp genome sizes [52,53]. In our study, the IR/SSC and IR/LSC borders of I. pinfanensis and H. triflora were compared to those of the other six Ericales representatives (P. persimilis, P. campechiana, D. lotus, B. fusicarpa, A. kolomikta and A. polysticta) to identify the IR expansion or contraction (Figure 4). The IRb/SSC boundary expansions in all the eight species extended into the ycf1 genes creating long ϕ ycf1 pseudogene fragments with varying length. The ycf1 pseudogene length in I. pinfanensis is 1101 bp, 1095 bp in H. triflora, 394 bp in A. kolomikta, 974 bp in A. polysticta, 1058 bp in B. fusicarpa, 1203 bp in D. lotus, 1078 bp in P. campechiana and 1018 bp in P. persimilis. Additionally, the ndhF gene is situated in the SSC region in I. pinfanensis, H. triflora, A. kolomikta, D. lotus, and P. persimilis, and it ranges from 32 bp, 9 bp, 71 bp, 10 bp and 44 bp away from the IRb/SSC boundary region respectively, but this gene formed an overlap with the ycf1 pseudogene in A. polystica, B. fusicarpa and P. campechiana cp genomes sharing some nucleotides of 3 bp, 1 bp and 1 bp in that order. The rps19 gene is located at the /IRb/LSC junction, of I. pinfanensis, H. triflora and of the other five cp genomes, apart from A. kolomikta in which this gene is found in the LSC region, 151 bp gap from the LSC/IRb junction. Moreover, the occurrence of rps19 gene at the LSC/IRb junction resulted in partial duplication of this gene at the corresponding region (IRa/LSC border) in I. pinfanensis, H. triflora, and A. polysticta cp genomes. The trnH gene is detected in the LSC region in I. pinfanensis and H. triflora. However, complete gene rearrangement of this trnH gene was observed resulting in complete duplication in the IR in the A. kolomikta chloroplast genome, 630 bp apart from the IR/LSC junction with psbA gene extending towards LSC/IRa border, however this gene is found in the LSC regions of the other five chloroplast genomes.  The border regions of the Ericales revealed that the I. pinfanensis and H. triflora cp genomes varied a little compared to other analyzed cp genomes. As shown in Figure 4, our analyses confirmed the IR evolution as revealed by the incomplete rps19 gene, which was duplicated in the IR region in The border regions of the Ericales revealed that the I. pinfanensis and H. triflora cp genomes varied a little compared to other analyzed cp genomes. As shown in Figure 4, our analyses confirmed the IR evolution as revealed by the incomplete rps19 gene, which was duplicated in the IR region in I. pinfanensis, H. triflora, and A. polysticta. Conversely, this rps19 gene was not duplicated among the remaining representatives of Ericales cp genomes. In a recent study [36,54] found that the trnH gene duplication occurs in Actinidiaceae, and Ericaceae. This duplication of genes in the LSC/IRb junction and the IRa/LSC junction would be of great importance in systematic studies. Furthermore, the rps19 gene at the LSC/IRb in I. pinfanensis and H. triflora is largely extended into the IRb region (199 bp and 100 bp) respectively. The SSC region of I. pinfanensis is 471 bp smaller than that of sister species H. triflora, but also smallest among the other species used in this study. Additionally, the I. pinfanensis LSC region is smaller than that of other species. Previous studies have shown that there is expansion of single copy (SC) and IR regions of angiosperms cp genomes during evolution [50,55], the I. pinfanensis and H. triflora cp genomes revealed that the border areas were highly conserved despite of slight genome size differences between the two species.

Phylogenetic Analysis
Phylogenetic relationships within the order Ericales have been resolved in recent published reports but the position of Balsaminaceae still remains controversial [33,[35][36][37][38][39][40]. In our study, the phylogenetic relationship of I. pinfanensis, and H. triflora and 38 other species of Ericales downloaded from GenBank (Table S3) was determined, with four cp genomes sequences belonging to Cornales being used as Outgroup species. Fifty-one common protein-coding sequences in all the selected cp genomes employed a single alignment data matrix of a total 35,548 characters (Supplementary Materials File S4). Almost all the nodes in the phylogenetic tree showed a strong bootstrap support. Though, Sapotaceae and Ebenaceae had low support (bootstrap < 70), this could be as a result of fewer samples in these families ( Figure 5). I. pinfanensis and H. triflora as sister taxa (Balsaminaceae) formed the basal family of Ericales with intensive support. In general, all the 38 species together with the two Balsaminaceae family species formed a lineage (Ericales) recognizably discrete from the four outgroup species (Cornales). All the species grouped together into 10 clades corresponding to the 10 families in order Ericales according to APGIV system [31]. This study will provide resources for species identification and resolution of deeper phylogenetic branches among Impatiens and Hydrocera genera.  [36,54] found that the trnH gene duplication occurs in Actinidiaceae, and Ericaceae. This duplication of genes in the LSC/IRb junction and the IRa/LSC junction would be of great importance in systematic studies. Furthermore, the rps19 gene at the LSC/IRb in I. pinfanensis and H. triflora is largely extended into the IRb region (199 bp and 100 bp) respectively. The SSC region of I. pinfanensis is 471 bp smaller than that of sister species H. triflora, but also smallest among the other species used in this study. Additionally, the I. pinfanensis LSC region is smaller than that of other species. Previous studies have shown that there is expansion of single copy (SC) and IR regions of angiosperms cp genomes during evolution [50,55], the I. pinfanensis and H. triflora cp genomes revealed that the border areas were highly conserved despite of slight genome size differences between the two species.

Phylogenetic Analysis
Phylogenetic relationships within the order Ericales have been resolved in recent published reports but the position of Balsaminaceae still remains controversial [33,[35][36][37][38][39][40]. In our study, the phylogenetic relationship of I. pinfanensis, and H. triflora and 38 other species of Ericales downloaded from GenBank (Table S3) was determined, with four cp genomes sequences belonging to Cornales being used as Outgroup species. Fifty-one common protein-coding sequences in all the selected cp genomes employed a single alignment data matrix of a total 35,548 characters (Supplementary Materials File S4). Almost all the nodes in the phylogenetic tree showed a strong bootstrap support. Though, Sapotaceae and Ebenaceae had low support (bootstrap < 70), this could be as a result of fewer samples in these families ( Figure 5). I. pinfanensis and H. triflora as sister taxa (Balsaminaceae) formed the basal family of Ericales with intensive support. In general, all the 38 species together with the two Balsaminaceae family species formed a lineage (Ericales) recognizably discrete from the four outgroup species (Cornales). All the species grouped together into 10 clades corresponding to the 10 families in order Ericales according to APGIV system [31]. This study will provide resources for species identification and resolution of deeper phylogenetic branches among Impatiens and Hydrocera genera.

Plant Materials and DNA Extraction
Total genomic DNA was extracted from fresh leaves of the I. pinfanensis and H. triflora collected from Hubei province (108 • 42 19 E, 30 • 12 33 N) and Hainan province (110 • 18 57 E, 19 • 23 10 N) in China using a modified cetyltrimethylammonium bromide (CTAB) method [56]. The DNA quality was checked using spectrophotometry and their integrity examined by electrophoresis in 2% agarose gel. The voucher specimens (HIB-lzz07, HIB-lzz18) were deposited at the Wuhan Botanical Garden herbarium (HIB).

Chloroplast Genome Sequence Assembly and Annotation
The pair-end libraries were constructed using the Illumina Hiseq 2500 platform at NOVOgene Company (Beijing, China) with an average insert size of approximately 150 bp for each genome.
The high-quality reads were filtered from Illumina raw reads using the PRINSEQ lite v0.20.4 (San Diego State University, San Diego, CA, USA) [57] (phredQ ≥ 20, Length ≥ 50), then assembled with closely related species cp genome using a BLASTn (with E value of 10 −6 ) with Primula chrysochlora (NC_034678) and Diospyros lotus (NC_030786) as reference species. In addition, the software Velvet v1.2.10 (Wellcome Trust Genome Campus, Hinxton, Cambridge, UK) [58] was used to assemble the obtained reads with K-mer length of 99-119. Then, consensus sequences with reference chloroplast genome was mapped using GENEIOUS 8.0.2 (Biomatters Ltd., Auckland, New Zealand) [59]. We used the online software local blast to verify the single copy (SC) and inverted repeat (IR) boundary regions of the assembled sequences.
The annotations of the complete cp genomes were performed using DOGMA (Dual Organellar GenoMe Annotator, University of Texas at Austin, Austin, TX, USA) [60]. The start and stop codons positions were further checked by local blast searches. Further, the tRNAs locations were confirmed with tRNAscan-SE v1.23 (http://lowelab.ucsc.edu/tRNAscan-SE/) [61]. The circular cp genome maps were generated using an online program (OGDrawV1.2, Max planck Institute of Molecular Plant Physiology, Potsdam, Germany) OrganellarGenomeDraw [62] with default settings plus manual corrections. Putative tRNAs, rRNAs and protein-coding genes were corrected by comparing them with the more similar reference species Primula chrysochlora (NC_034678) and Diospyros lotus (NC_030786) resulting from BLASTN and BLASTX searches against the nucleotide database NCBI (https://blast.ncbi.nlm.nih.gov/). The cp genome sequences were submitted to GenBank database, accession numbers I. pinfanensis (MG162586) and H. triflora (MG162585).

Genome Comparison and Structure Analyses
The IR and SC boundary regions of I. pinfanensis and H. triflora, and the other six Ericales species were compared and examined. For synonymous codon usage analysis, about 52 protein-coding genes of length > 300 bp were chosen. Online program CodonW1.4.2 (http://downloads.fyxm.net/CodonW-76666.html) was used to investigate the Nc and RSCU parameters. The simple sequence repeats (SSRs) of the two study species and other Ericales representatives were detected using MISA software [63] with SSR search parameters set same as Gichira et al. [48].

Phylogenetic Analyses
To locate the phylogenetic positions of I. pinfanensis and H. triflora (Balsaminaceae) within order Ericales, the chloroplast genome sequences of 38 species belonging to order Ericales and four Cornales species as outgroups, were used to reconstruct a phylogenetic relationships tree. The Phylogenetic tree was performed based on maximum likelihood (ML) analysis using RAxMLversion 8.0.20 (Scientific Computing Group, Heidelberg Institute for Theoretical Studies, Institute of Theoretical Informatics, Karlsruhe Institute of Technology, Karlsruhe, Germany) [65]. Consequently, based on the Akaike information criterion (AIC), the best-fitting substitution models (GTR + I + G) were selected (p-inv = 0.47, and gamma shape = 0.93) from jModelTest v2.1.7 [66]. The bootstrap test was performed in algorithm of RAxML with 1000 replicates.

Conclusions
The cp genomes of I. pinfanensis, and H. triflora from the family Balsaminaceae provide novel genome sequences and will be of benefit as a reference for further complete chloroplast genome sequencing within the family. The genome organization and gene content are well conserved typical of most angiosperms. Fifty protein-coding sequences, shared by selected species from Ericales as well as our study species, were used to construct the phylogenetic tree using the maximum likelihood (ML). Majority of the nodes showed strong bootstrap support values, and the few nodes with low support, should be solved using other methods (e.g., restriction-site-associated DNA sequencing). The two species (I. pinfanensis, and H. triflora) were placed close to each other. These findings strongly support Balsaminaceae as a basal family of the order Ericales. Lastly, the Balsaminaceae (I. pinfanensis, and H. triflora) has a relationship with the other 38 species, which are all grouped into one Clade (Ericales). This study will be of value in determining genome evolution and understanding phylogenomic relationships within Ericales and give precious resources for the evolutionary study of Balsaminaceae.