Comparative Chloroplast Genome Analysis of Wax Gourd (Benincasa hispida) with Three Benincaseae Species, Revealing Evolutionary Dynamic Patterns and Phylogenetic Implications

Benincasa hispida (wax gourd) is an important Cucurbitaceae crop, with enormous economic and medicinal importance. Here, we report the de novo assembly and annotation of the complete chloroplast genome of wax gourd with 156,758 bp in total. The quadripartite structure of the chloroplast genome comprises a large single-copy (LSC) region with 86,538 bp and a small single-copy (SSC) region with 18,060 bp, separated by a pair of inverted repeats (IRa and IRb) with 26,080 bp each. Comparison analyses among B. hispida and three other species from Benincaseae presented a significant conversion regarding nucleotide content, genome structure, codon usage, synonymous and non-synonymous substitutions, putative RNA editing sites, microsatellites, and oligonucleotide repeats. The LSC and SSC regions were found to be much more varied than the IR regions through a divergent analysis of the species within Benincaseae. Notable IR contractions and expansions were observed, suggesting a difference in genome size, gene duplication and deletion, and the presence of pseudogenes. Intronic gene sequences, such as trnR-UCU–atpA and atpH–atpI, were observed as highly divergent regions. Two types of phylogenetic analysis based on the complete cp genome and 72 genes suggested sister relationships between B. hispida with the Citrullus, Lagenaria, and Cucumis. Variations and consistency with previous studies regarding phylogenetic relationships are discussed. The cp genome of B. hispida provides valuable genetic information for the detection of molecular markers, research on taxonomic discrepancies, and the inference of the phylogenetic relationships of Cucurbitaceae.


Introduction
Cucurbitaceae is a moderately large family of about 130 genera and 900 species [1]. Because of their economic importance in temperate regions, species of the Cucurbitaceae family have a long and close association with human beings [2]. Familiar edible and medicinal fruits, such as cucumber (Cucumis sativus), melon (Cucumis melo), watermelon (Citrullus lanatus), bottle gourd (Lagenaria siceraria), pumpkin, and squash (Cucurbita spp.) are the main crops of Cucurbitaceae [3]. All are economically valuable fruit crops. The early molecular phylogeny of Cucurbitaceae was reconstructed using five chloroplast markers, which weakly support two subfamilies of Cucurbitoideae and Nhandiroboideae [4]. Recent DNA was extracted from fresh leaves using modified CTAB [26], the quantity and quality of extracted DNA was assessed by spectrophotometry, and the integrity was evaluated using a 1% (w/v) agarose gel electrophoresis [27]. An Illumina TruSeq Library Preparation Kit (Illumina, San Diego, CA, USA) was used to prepare approximately 500 bp of paired-end libraries for DNA inserts, according to the manufacturer's protocol. These libraries were sequenced on the Illumina HiSeq 4000 platform in Novogene (Nanjing, China), generating raw data of 150 bp paired-end reads. About 14.6 Gb high quality, 2 × 150 bp pair-end raw reads were obtained and were used to assemble the complete chloroplast genome of B. hispida.

Genome Assembly and Annotations
The raw data were preprocessed using Trimmomatic 0.39 software [28], including removal of Adapter sequences and other sequences introduced in the sequencing, removal of low-quality and over-N-base reads, etc. The quality of newly produced clean short reads was assessed using FASTQC v0.11.9 and MULTIQC software [29,30], and high-quality data with Phred scores averaging above 35 were screened out. According to the reference sequence (Cucumis melo), the chloroplast-like (cp) reads were isolated from clean reads by BLAST [31]. Short reads were de novo assembled into long contigs with SOAPdenovo 2.04 [32] by setting kmer values as 35, 44, 71, and 101. Finally, the long-contigs complete sequence expansion and gap filling using Geneious ver. 8.1 [33], forming the complete chloroplast genome. The complete chloroplast genome was further validated and calibrated by using de novo splicing script NOVOplasty 4.2 [34]. GeSeq [35] was used to annotate the de novo assembled genomes, tRNAscan-SE ver 1.21 [36] was applied to detect tRNA genes with default settings, and RNAmmer [37] was used to validate rRNA genes with default settings. As a final check, we compared the results with the reference sequence and manually corrected the erroneous genes by GB2Sequin [38]. The circular map of the genomes was drawn by using Organellar Genome DRAW (OGDRAW) [39]. The newly assembled B. hispida chloroplasts genomes were deposited in GenBank, with the accession number MW362306.

Chloroplast Genome Comparison
In order to gain a better understanding of the characteristics of the cp genome of B. hispida, we selected three species that are not only closely related to B. hispida but also representative of Benincaseae to perform comparative analysis. Sequences of their complete chloroplast genome were downloaded from NCBI database, with the following accession numbers: Lagenaria siceraria (MT773628), Citrullus colocynthis (NC_035727), and Citrullus lanatus (KY430692).

Codon Usage and Putative RNA Editing Site
Codon usage and amino acid frequency were calculated by Geneious Prime ® 2020 [40], and relative synonymous codon usage (RSCU) of protein-coding genes was evaluated by MEGA-X [41]. We also used predictive RNA editors for plant chloroplast (PREP-cp) [42] to investigate putative RNA editing sites in the cp genomes of B. hispida, C. colocynthis, C. lanatus, and L. siceraria.

Repeat Sequences and SSR Analysis
MIcroSAtellite identification tool (Misa) [43] was used to determine simple sequence repeats (SSRs) or microsatellites in cp genomes of four species. SSRs were determined by a settled minimum threshold of nine for mononucleotide repeats, four for dinucleotide, and three for tri-, tetra-, penta-, and hexanucleotide repeats. Oligonucleotide repeats were analyzed by REPuter program [44] to find four types of repeats, including forward (F), reverse (R), complementary (C), and palindromic (P). These four types of repeats were detected with a minimum repeat size of 20 bp, edit distance of 3, and 90% similarities.

Comparative Analysis of cp Genomes in Benincaseae
IRscope [45] was used to detect the contraction and expansion of IRs boundaries, which were visualized between four main regions in chloroplast genome (LSC/IRb/SSC/IRa). The mVISTA program [46] was used to compare the cp genome of four species using Shuffle-LAGAN model with Lagenaria siceraria set as the reference sequence.
DnaSP [46] was used to perform sliding window analysis using multiple alignment of complete cp genome of four selected species, along with the determination of synonymous (Ks) and non-synonymous (Ka) substitutions and their ratio (Ka/Ks). Geneious was used to detect the types, numbers, lengths, and positions of SNPs and InDels in LSC, SSC, and IR regions. To further evaluate their natural selection pressure, genes that presented Ka/Ks value greater than one were tested with site model using CodeML [47] algorithm implemented in EasyCodeML [48]. The likelihood ratio test (LRT) was used to compare seven codon substitution models (M0, M1a, M2a, M3, M7, M8, and M8a). The Bayes empirical Bayes (BEB) evaluated the posterior probability of positive selection sites.

Phylogenetic Analysis
We selected and downloaded the sequences of 23 species from Cucurbitales and three outgroup species including Libidibia coriaria (NC_026677), Glycine max (NC_007942) and Solanum lycopersicum (NC_007898) from NCBI to perform phylogenetic tree building. Maximum likelihood (ML) tree was constructed through two approaches. One phylogenetic tree was constructed using complete cp genome and the other was built with 72 gene sequences. MAFFT alignment was performed using 72 concatenated gene sequences and the best-fit model was found by MEGA-X [41] All indels was excluded for both alignments, leaving only substitutions for ML analysis. The best-fit models applied for all three were GTR + G, determined based on Bayesian information criterion (BIC) [49].

Chloroplast Genome Assembly, Organization, and Features of Benincasa Hispida
The paired-end sequencing of B. hispida by Illumina HiSeq 4000 generated around 14.6 GB raw data with 82.6 million 150 bp reads. We de novo assembled its complete chloroplast genome and the data were submitted to NCBI under accession number MW362306 after a thorough check for correctness. As shown in Table 1 and Figure 1, the size of its complete chloroplast genome is 156,758 bp in length, presenting a typical quadripartite structure with a large single-copy region (LSC, 86,538 bp), a small single-copy region (SSC, 18,060 bp) and two inverted repeat regions (IRa/b, 26,080 bp each).  The cp genome of B. hispida had 131 genes (Table 2), including 86 protein-coding genes, 37 tRNA genes, and 8 rRNA genes, 18 of which were duplicated genes (7 protein-coding genes, 7 tRNA genes, and 4 rRNA genes). The total GC content of the cp genome was 37.2%, with the IR regions having the highest GC content at 42.9%, followed by LSC (35%) and SSC (31.7%). In terms of the GC contents of the different gene types, the number of rRNA (55.2%) and tRNA (53.2%) was relatively high, and that of CDS was 37.9%. In total, 18 genes contained introns, 16 of which (10 protein-coding genes and 6 tRNA genes) contained 1 intron, and 2 CDSs (ycf3 and clpP1) possessed 2 introns (Table S1). Among these genes, 17 genes were duplicated in the IR regions except one trans-splicing gene, which was observed in the rps12 gene with 5'-end located in the LSC region and 3' end duplicated in the IR regions. The truncation event was observed in the ycf1 gene, which started in the IRa region and ended at the SSC region, leaving a 100 bp truncated copy in the IRb region. Table 2. Genes predicted in the chloroplast genome of Benincasa hispida. The number of asterisks after the gene names indicates the number of introns contained in the genes.

Codon Usage and Amino Acid Frequencies
The complete cp genome of Benincasa hispida contained 80,109 bp of coding sequences (CDSs) that encoded 86 genes, including 26,703 codons that fit in 64 codon types. The results of the amino acid frequency analysis showed that leucine, with 10.5% occurrence, was the most abundant amino acid, followed by isoleucine, with 8.5%. Cysteine, with only 1.1% abundance, was the amino acid that occurred the least.
The relative synonymous codon usage (RSCU) of the four species was also calculated, presenting a high codon bias of A or T bases. The distribution of the codon usage showed that the codons ending with A or T had RSCU > 1 except GGT (Glycine, 0.96), AGT (Serine, 0.9), and CGT (Arginine, 0.68), revealing that the codons ending with A or T were preferred, while the codons ending with C or G were non-preferred. Among all three stop codons, TAA, with 64% abundance, was the most frequent (Table S2).

Putative RNA Editing Site within Benincaseae
RNA editing events are typical in the cp genomes of most land plants and essential for understanding the chloroplast genome at the transcript level. For this purpose, we determined the RNA editing site in the cp genomes of four species from Benincaseae. In the cp genome of Benincasa hispida, PREP-web found 58 putative RNA editing sites in 21 CDS (Table S3a). Among these genes, the ndhB gene, with thirteen editing sites, was determined to be the most variant gene, followed by ndhD (eight sites) and rpoB (five sites). We also found that 81% of all RNA editing events occurred at the second nucleotide position of the codons, while none of these events were located in the third codon position.
Moreover, these RNA editing events resulted in post-transcriptional substitutions, causing amino acid conversions. In the group of these conversions, fifty-four out of fifty-six RNA editing sites led to hydrophobic products, comprising phenylalanine (9), isoleucine (5), leucine (32), methionine (2), valine (4), and tryptophan (2). Four exceptions led to hydrophilic (neutral) amino acid products, including cysteine (1), tyrosine (2), and serine (1). Furthermore, serine-to-leucine was found to be the most abundant post-transcriptional substitution, with 41.82% of all RNA editing events, followed by proline-to-leucine (14.55%) and serine-to-phenylalanine (7.27%). It is worth mentioning that two RNA editing events were detected that transformed ACG (Thr) to the initiation codon AUG, resulting in the start of translation in the ndhB and ndhD genes.
As shown in Table S3b, the total number of RNA editing sites detected was 57 in Citrullus lanatus and 55 in Lagenaria siceraria and Citrullus colocynthis. All the patterns mentioned above showed high consistency in all four species analyzed, with only minor differences in terms of numerical values.

Repeated Sequence and SSR Analysis
In this study, we analyzed microsatellites or simple sequence repeats (SSRs) in the cp genome of Benincasa hispida, Citrullus lanatus, Lagenaria siceraria, and Citrullus colocynthis using MISA-web, and high similarity was revealed between the four species ( Figure 2). We found that B. hispida contained the most abundant number of SSRs (238), while C. lanatus, with only 219 SSRs, had the least. In the cp genome of B. hispida, most of the SSRs were mononucleotides (42%), varying from 9 to 15 repeat units. Meanwhile, the abundance of dinucleotide was only 25%, which was slightly lower than that of trinucleotide (30%). The frequencies of tetranucleotide and pentanucleotide were only 3% and 0.42%, respectively, and hexanucleotide repeats were absent from all the species ( Figure 2C). Moreover, most of the mononucleotide repeats were A/T motifs, while AT/TA motifs comprised 68% of the dinucleotide repeats (Table S4).
We also analyzed the distribution of SSRs in two different types of regions, specifically LSC/IR/SSC regions and intergenic spacer (IGS)/gene regions. According to the results, most of the repeats were located in the LSC region, varying from 136 in C. lanatus to 148 in B. hispida, followed by the SSC region (38 in B. hispida) and IR regions. Noticeably, the SSC number in the IR regions was 26 in all the species except for L. siceraria (24), implying that the IR regions were more conserved than the LSC and SSC regions (Figure 2A). The IGS regions were determined to have a high abundance of SSRs in comparison with the gene regions. We found 125 SSRs within 46,150 bp IGS regions and 116 SSRs in 112,281 bp gene regions, meaning the density of SSRs in the IGS regions was 2.62 times of that of the gene regions ( Figure 2B). Similar results were present in all the species.  (Figure 2A). The length of repeats was mostly found between 20 and 24 bp ( Figure 2C). The palindromic repeats were the most abundant repeats, followed by forward repeats, whereas the number of reverse repeats was low. None of the species had complementary repeats. We also located the region of each oligonucleotide repeat; significant consistency was presented among the four species. The number was exactly the same in all the species regarding the repeats located in the IR regions (6) and some shared sequences, including sequences between LSC and IRa/b (4), between SSC and IRa/b (2), and from IRb to IRa crossing SSC ( Figure 2B).

IR Contraction and Expansion
The genome length of the chloroplast ranged from 159,758 bp (B. hispida) to 157,147 bp (C. colocynthis). Furthermore, in the cp genome of B. hispida, the length of the IR regions was the shortest with 260,080 bp, while that of the SSC region was the longest with 180,060 bp (Table S5). Thus, we inferred that the variation in size of the cp genome was contributed to by the expansion and contraction of the IR regions ( Figure 3). The junction sites between each region were denoted as: J LB (IRb/LSC), J SA (SSC/IRa), J SB (IRb/SSC), and J LA (IRa/LSC). All eight species analyzed presented functional ycf1 genes, six of which were at J SA , while the other two were located in the SSC region completely. Moreover, the ycf1Ψ (pseudo-copy) was only present in two species (B. hispida and L. siceraria) at J SB and were 3 bp and 25 bp in the SSC region, respectively. The ndhF gene was revealed in all species in J SB with the same length (2246 bp) and relatively consistent position with only a few bp in the IRb region, except C. hystrix, with 21 bp. and B. hispida (completely located in the SSC region).
The rpl2 gene was found close to J LB , while that of two species (C. moschata and C. lanatus) were in the LSC region with 11 bp and 6 bp, respectively. At the same time, the duplicate rpl2 genes were absent in the same two specific species. The rps19 gene was the most variant gene among all the genes close to the IR junction. In the four species, the rps19 genes were 2 bp in the IRb region and the remaining four were completely in LSC region.

Divergence Analysis of Chloroplast Genome
To identify the diversity in the chloroplast genomes of four Benincaseae species, we visualized the percentage of identity between the sequences and colored regions of high conservation using mVISTA program. As shown in Figure 4, the sequences varied remarkably among different regions. Firstly, most of the differences were located in the LSC and SSC regions, while the IR regions were almost identical among the four species except the rps12 gene, revealing that the IR regions were more conservative than the LSC and SSC regions. Moreover, the IGS regions revealed themselves to be remarkably more divergent than the gene regions. Notable divergent non-coding regions included: trnR-UCU-atpA, atpH-atpI, trnT-GGU-psbD, trnL-UAA-trnF-GAA, accD-pasI, and ndhF-rpl32. Genes such as ycf1, ycf2, accD, psbA, ccsA, ndhF, and matK were found to be highly divergent coding genes. The Ka/Ks ratio is an essential index to identify a mutation as neutral, purifying, or beneficial. Thus, we compared B. hispida with C. colocynthis, C. lanatus, and L. siceraria, respectively, to analyze the synonymous substitutions (Ks), the non-synonymous substitutions (Ka), and their ratio (Ka/Ks) of 73 PCGs (Table S6). In total, 18 genes could not be determined due to absent information (Ks = 0). After deleting these genes, as well as the non-substitution results, we found that the genes carrying out photosynthesis functions revealed Ka/Ks = 0 or at relatively low values, indicating that these groups of genes were fairly conserved. The Ka/Ks ratio of 26 genes was lower than 0.5 and that of 96% genes was lower than 1, with only 5 exceptions (accD, clpP, rps4, ycf1, and ycf2). We then performed a purifying/positive selection site evaluation for these five genes (Table S7). However, only two genes o, accD and rps4, presented sites potentially under positive selection, indicated by the high empirical Bayes values (Table 3).
To obtain a holistic understanding of the sequence divergence, we performed a sliding window analysis to visualize the nucleotide variability values of all the cp genomes. We found that none of the π values of the CDS genes exceeded 0.05 and that the IGSs were more divergent than the gene regions, which was consistent with the aforementioned analysis. It can be clearly seen in the figure that the SSC and LSC regions were much more divergent than the IR regions, the π value of which was remarkably low and mirror-symmetrized with SSC as the center ( Figure 5).

Phylogenetic Analysis
To locate the phylogenetic position of B. hispida precisely, we selected 26 species (Table S8) and constructed two phylogenetic trees using the complete cp genome ( Figure 6A) and 73 selected CDSs ( Figure 6B), respectively. The results all suggested that B. hispida was closely related with Cucumis, Citrullus, and Lagenaria as their sister group, with fairly high bootstrap values. The phylogenetic relationship results of the two approaches presented high consistency, with two main variations. Firstly, in general, the bootstrap values in the tree that applied the complete cp genome were higher than in the tree constructed with 73 CDSs (Figure 6B). In addition, Begoniaceae was a sister group with Coriariaceae and Corynocarpaceae, according to Figure 6A, while in Figure 6B, Coriariaceae and Corynocarpaceae were the early-diverging lineages of Begoniaceae. However, only 82 bootstrap values supported the former situation ( Figure 6A) while 94 supported the second ( Figure 6B).

Discussion
In this study, we sequenced and reported the complete chloroplast genome of Benincasa hispida and performed comparative analyses with three other, closely related species selected from the Benincaseae but distinct enough to obtain reasonable results, providing valuable genetic data for phylogeographic and population genetic investigation [50,51].
The cp genome revealed high consistency in terms of its quadruple structure, gene content, and organization not only in Benincaseae [52,53], but also in other angiosperms [54]. The genome size differed less than 400 bp, with almost identical gene numbers, signifying that the cp genomes among the four analyzed species were conservative on the whole. The GC content of B. hispida varied across different regions and functions. The rRNA sequences were considerably rich in GC bases; as a consequence, the IR regions rich in rRNA appear to have had higher GC content than the other regions. These findings agree with those of previous studies [55,56].
However, changes were found that provide valuable information for understanding the development and evolution [50,57]. The bias of the codon usage in the plant cp genome was an important evolutionary feature for the studies regarding mRNA translation, new gene discovery, and molecular biology [58]. Previous studies have confirmed that genes tend to choose preferred synonymous codons for specific amino acids rather than randomly distributions [59,60]. Our study showed that genes of B. hispida prefer codons with A/T in the third position, which was consistent with previous studies [61,62].
Microsatellites, or SSRs, are widely distributed in cp genomes that serve as molecular markers for phylogenetic relationship inference [63,64]. Moreover, SSRs are also related to different types of genome rearrangement, recombinations, and large inversions [65,66]. Similar to previous studies, we found that mononucleotide repeats were the most abundant types of repeat and that their numbers in the LSC region far surpassed those in the SSC and IR regions [67]. Furthermore, a greater number of palindromic repeats were found among four types of repeat, while previous studies revealed that the forward repeats were the most abundant repeats [61,68]. We specifically analyzed the abundance of SSRs that differed from gene regions to intronic gene regions and verified that the IGSs contained much higher SSR density than the others. Thus, we inferred that IGS regions may undergo gene rearrangement and recombination more frequently than gene regions. Moreover, our results support the hypothesis that cpSSRs are more often composed by polyA or polyT than polyG or polyC [69,70], implying that IGSs might be relatively rapidly mutating regions [71,72].
It is commonly agreed that variations in genome size in the chloroplast are the consequence of IR contraction and expansion, leading to gene duplication and deletion and the presence of pseudogenes [68,73]. We found that ycf1Ψ pseudogenes were only detected in B. hispida and L. siceraria, which were also sister groups in the ML phylogenetic trees. Furthermore, no rps19Ψ pseudogenes were observed in any of the species analyzed; their presence was thought to be responsible for the loss of function of the rps19 gene [74,75]. These results imply that gene variation at IR boundaries may contribute to the understanding of the cp genome at a molecular level and serve as an indicator for evolutionary investigation [76,77].
It is worthwhile to study the genetic diversity among the four Benincaseae species because the chloroplast genome plays a crucial role in the study of phylogeny, gene flow between species, and population genetics among different species. [64,78]. The coding regions were generally found to be more conserved than the non-coding regions. Furthermore, some of the coding genes, namely the ycf1, ycf2, matK, accD, and ndhF genes, were commonly found to be relatively divergent [79,80]. In addition, the LSC and SSC regions were further confirmed to be more divergent in comparison with the IR regions [81]. We also discovered that genes related to photosynthesis with low Ka/Ks ratios showed slow evolution rates, while functional genes, such as accD, revealed high evolutional rates, indicating that genes carrying out vital functions were conserved and vice versa [74,82].
Among the five genes that showed Ka/Ks values greater than one, two genes, accD and rps4, presented one positive selection site, respectively. These results indicate that the accD gene may have changed under evolutionary pressure [83] Currently, protein-coding genes are commonly implemented for phylogenetic tree building [84]. While the complete cp genome contains richer information but requires a longer time to perform, higher-end equipment and the population distance may be exaggerated for the highly divergent features of IGS genes [85]. In this study, we applied both methods to build the phylogenetic trees. The first tree, built with the complete cp genome, revealed higher bootstrap values in general, while the other tree, built with coding genes, showed a slightly different phylogenetic order in four species, out of twenty-six in total. In general, the phylogenetic position revealed was consistent with previous studies [86][87][88][89]. However, the phylogenetic relationships we discovered within Cucurbitaceae differed from the results of previous phylogenetic marker-based taxonomy research [90,91]. This may have been due to the different approaches used for phylogenetic tree construction, with further investigation needed.
In conclusion, our study first shed light on the structure and content of the cp genome of B. hispida, an economically important fruit crop widely distributed in several tropical countries and extensively consumed worldwide. We also offered information regarding similarities and divergence, enriching the understanding of the species of Benincaseae. Moreover, information about highly polymorphic regions was also provided regarding molecular markers and highly divergent regions, which might be useful for further studies of the taxonomy and phylogeographics of Benincaseae subfamilies.

Data Availability Statement:
The data that support the findings of this study are openly available in the GenBank of NCBI at https://www.ncbi.nlm.nih.gov (accessed on 13 January 2022), reference number (MW362306).

Conflicts of Interest:
The authors declare that they have no conflict of interest.