Identification of the Complete Chloroplast Genome of Malus zhaojiaoensis Jiang and Its Comparison and Evolutionary Analysis with Other Malus Species

The genus Malus is rich in species and many of its plastid genomes have been released. However, limited resources and few markers are not conducive to the comparison of differences among species and resource identification and evaluation. In this study, the complete chloroplast genome of Malus zhaojiaoensis was studied by NGS sequencing, with a total length of 159998 bp. It consists of four regions, LSC (88,070 bp), IRB (26,359 bp), SSC (19,210 bp) and IRA (26,359 bp). M. zhaojiaoensis cp genome contained a total of 111 genes made up of three classes: 78 coding sequences, 29 tRNA genes, and four rRNA genes. In addition, a total of 91 SSRs and 43 INEs were found in the M. zhaojiaoensis cp genome, which was slightly different from M. baccata and M. hupehensis in number. The analysis of codon usage and RNA editing showed that high-frequency codons tended to end at A/U bases and RNA editing mainly occurred at the second codon. Comparative genome analysis suggested that the cp genomes of eight Malus species had higher overall similarity, but there were more variation hotspots (rps16_trnK-UUU, trnG-UCC_atpA, atpH_atpF, trnT-GGU_psbD, etc.) in the LSC region. By building evolutionary trees, it can be clearly observed that M. zhaojiaoensis formed a large group with eight species of Malus, but was relatively independent in differentiation. In conclusion, this study provides high-quality chloroplast genome resources of M. zhaojiaoensis and discusses the genetic variation characteristics of Malus genus. The findings of this study will provide a good reference for plastid genome assembly and interspecific comparison in the future.


Introduction
The genus Malus is rich in species and widely distributed and cultivated throughout the world [1]. According to the principle of ecological geography, Malus genus can be divided into two groups: cultivated species and wild species [2]. Among them, cultivated species are greatly affected by human activities and are mainly used for fresh food and processing. Of course, some species are also used as ornamental plants [3]. Wild species exist in natural distribution areas and are often used as rootstock resources and breeding materials in production due to their good adaptability of introduction and affinity for grafting [4]. As a wild species of Malus native to China and widely distributed in northeast, northern and southwest China, M. baccata (L.) Borkh. has diverse populations and variation types [5], which is of great value in producing activity and scientific research. M. zhaojiaoensis Jiang is a new species of Malus found in Zhaojue Xian, Sichuan Province [6]. According to the report [6], M. zhaojiaoensis has oval leaves, reddish flowers, and small fruits, which are of great ornamental value. It has similar characteristics with M. baccata and M. rockii [7], but its evolutionary status and genetic relationship is still ambiguous and needs further research.
There are many methods to study plant genetic evolution, including morphological comparison, biochemical methods (isozymes, etc.) and a wide range of molecular markers (Restriction Fragment Length Polymorphism, Amplified Fragment Length Polymorphism, Random Amplified Polymorphism DNA, Simple Sequence Repeat, Single Nucleotide Polymorphism) [8]. In recent years, with the emergence of high-throughput sequencing technology and the development of bioinformatics based on multiple omics, methods such as resequencing, comparative genomics, and chloroplast genome assembly provide new and efficient solutions for species classification and variation detection [9]. Chloroplast DNA exists in plant chloroplasts in a double chain ring shape, and most plant chloroplast genomes consist of four typical parts, namely a large single copy region (LSC), a small single copy region (SSC), and two inverted repeat regions (IRA and IRB) [10]. Compared with nuclear genomes, chloroplast genomes are relatively conservative and small, so they play an irreplaceable role in species identification [11].
Up to now, there have been numerous studies on chloroplast genomes of Malus plants, such as M. toringoides [12], M. kansuensis [13], M. sieboldii [14], M. sylvestris [15], M. hupehensis [16], and M. prattii [17]. This has played a positive role in the utilization and protection of Malus resources. In this paper, the complete chloroplast genome of M. zhaojiaoensis was assembled based on second-generation sequencing data and bioinformatics methods, and its basic characteristics, sequence similarity, and evolutionary relationship were analyzed and compared. The above results can provide reference for explaining the genetic variation characteristics of M. zhaojiaoensis and Malus. At the same time, the findings of this paper will play a positive role in the conservation and development of wild species such as M. zhaojiaoensis.

Sample Collection and DNA Sequencing
Young, fresh and clean leaves of M. zhaojiaoensis Jiang tree were collected from Fruit Research Institute, Chinese Academy of Agricultural Sciences (Xingcheng city, Liaoning Province, China). Total DNA was extracted by improved CTAB (Cetyltrimethylammonium Bromide) method [18]. Construction of genomic DNA library and DNA sequencing were performed using Illumina HiSeq X Ten system [19]. 150 bp paired-end sequences were sequenced in Illumina platform, and insert size was 350 bp. Adapter sequences were removed by Trimmomatic v0.39 software and adapter-trimmed raw reads were used for subsequent assembly analysis [20].

Chloroplast Genome Assembly and Annotation
GetOrganelle v1.7.5.0 [21] and NOVOplasty4.2 [22] were used for chloroplast genome assembly. In GetOrganelle, the values of kmers were set to 21, 45, 65, 85, 105. The complete chloroplast genome sequence of M. zhaojiaoensis was obtained after the assembly results were confirmed. After that, the sequence was submitted to PGA (Plastid Genome Annotator) program [23] for gene annotation, and the annotation information of CP-GAVAS [24] and GeSeq [25] was also used for integration and merger. Chloroplast genome sequence and annotation file were manually revised and submitted to NCBI website (https://www.ncbi.nlm.nih.gov/, accessed on 20 February 2022). Its accession number was OM793283. OrganellarGenomeDRAW [26] was used to graphically display the annotations of cp genome, and different genes were distinguished by different colors.

Analysis of Codon Usage Characteristics and RNA Editing Sites
All the coding sequences were extracted based on the chloroplast genome annotation of M. zhaojiaoensis. The calculation and comparison of RSCU (relative synonymous codon usage) value was operated in CodonW v1.4.2 (http://codonw.sourceforge.net/, accessed on 12 February 2022) programme. PREP-Cp (http://prep.unl.edu/, accessed on 16 February 2022) was used to predict potential RNA editing sites in coding sequences [28].  [29]. The sequences of chloroplast genomes were aligned in VISTA (https://genome.lbl.gov/vista/index.shtml, accessed on 16 February 2022); Shuffle-LAGAN was designated as alignment program [30]. IRscope was used to visualize the junction sites of the chloroplast genomes [31]. Phylogenetic and cluster analysis was performed in HomBlocks (https://github.com/fenghen360/HomBlocks, accessed on 18 February 2022) [32] and MEGA tool (https://www.megasoftware.net/, accessed on 18 February 2022) [33], and C. hupehensis (Crataegus) was selected as outgroup. Two types of evolutionary trees (Neighbour-Joining tree and Maximum Likelihood tree) were constructed in MEGA X, both of which were examined with bootstrap method (1000 replications). In addition, ClustalW (Codons) alignment method and Jukes-Cantor substitution model were selected for NJ evolutionary trees based on single-copy genes (matK and rbcL).  Figure 1). By statistical calculation, GC content in IR regions (42.7%) was higher than that in the whole cp genome (36.6%) of M. zhaojiaoensis, indicating that the two inverted repeat regions (IRA and IRB) were relatively stable.

Interspersed Nuclear Elements
Interspersed repeats are another important repeating element in the genome and there are abundant INEs (interspersed nuclear elements) in the chloroplast genome of plants. Four types (forward_direct, reverse, complement, palindromic) of INEs were calculated in REPuter. In the cp genome of M. zhaojiaoensis, the INEs contained 21 forward repeats, 20 palindromic repeats, 1 reverse repeat, and 1 complement repeat ( Figure 2C, Table S2). The M. baccata cp genome had 27 F, 19 P, 6 R, and 1 C type repeats, while there were 29 forward, 23 palindromic, 5 reverse, and 0 complement repeats in M. hupehensis ( Figure 2C). In general, the number of INEs in M. zhaojiaoensis (43) was significantly less than that in M. baccata and M. hupehensis, and the total number of INEs in M. baccata (53) and M. hupehensis (57) was close.
In addition, from the perspective of repeated fragment size, it can be found that most of the sequence lengths were concentrated in the range of 30-50 bp, especially 30-40 bp. There was one INE repeat with a sequence length of more than 50 bp in each of the chloroplast genomes of three Malus species ( Figure 2D).

Relative Synonymous Codon Usage
The relative synonymous codon usage (RSCU) can represent the relative probability of using synonymous codons and reflect the usage bias of different codons [34]. The total number and relative frequency of all codons were obtained in CodonW software. The results showed that leucine was the most frequent amino acid in the chloroplast coding sequence of M. zhaojiaoensis, followed by isoleucine, serine, glycine and arginine. In leucine, codon UUA was widely used. In alanine, the codon GCU was the most frequently used. The arginine bias uses AGA as the codon, and the termination codon bias ends with UAA ( Figure 3). There were 30 kinds of codons whose RSCU value was greater than 1, including GCA (Ala), CCA (Pro), UCA (Ser), AGU (Ser), ACA (Thr), UUG (Leu), CUU (Leu), CGU (Arg), UUU (Phe), GGU (Gly), etc. A total of 29 of the 30 codons mentioned above ends with an A/U base, indicating that the third bases of the high-frequency codons are biased toward A/U. In addition, the usage of codons such as AGC, CUG, GAC, UAC and CUC in the cp genome of M. zhaojiaoensis was relatively low.

RNA Editing Sites
RNA editing is a condition in which post-transcribed coding sequence undergoes base conversion that results in changes in amino acids. RNA editing also occurs in plant chloroplast encoded proteins, which can participate in the expression and regulation of genes, and affect and change gene function [35]. In this study, all potential RNA editing sites in the coding sequences of M. zhaojiaoensis cp were predicted. The results indicated that a total of 63 RNA editing sites were distributed on 25 genes. These genes mainly included ndh (ndhA, ndhB, ndhD, ndhF, ndhG), matK, rpo (rpoA, rpoB, rpoC1, rpoC2), accD, atp (atpA, atpB, atpF, atpI), pet (petB, petG), rps (rps2, rps14, rps16), etc. ndhB had the most editing loci with 12, followed by ndhD (8), matK (5), ndhF (4), rpoB (4) and rpoC2 (4) (Figure 4). In terms of the location of RNA editing sites, 48 sites were located at the second base of the codon, accounting for 76.19% of all editing sites (Table S3). In conclusion, in the chloroplast coding genes of M. zhaojiaoensis, RNA editing mainly occurred at the second codon. Interestingly, in the CDS of M. zhaojiaoensis cp, RNA editing was the most likely to lead to transitions from Serine (S) to Leucine (L), followed by Proline (P) to Leucine (L), and Serine (S) to Phenylalanine (F) ( Table S3).

Alignment of Chloroplast Genomes in Malus Species
In order to compare the sequence similarities and differences of chloroplast genomes of different Malus genus, M. zhaojiaoensis and three Malus species were used for global alignment ( Figure 5). The chloroplast sequences of M. zhaojiaoensis showed good collinearity with three Malus plants. In contrast, the similarity of M. zhaojiaoensis to M. baccata and M. hupehensis was higher, but the similarity to M. yunnanensis was slightly lower. To find the variation hotspot regions of cp genomes, eight species of Malus (including M. zhaojiaoensis) were used for comparative analysis where M. hupehensis (MK020147) was a reference. The results showed that most of the differences were located in the LSC region of chloroplast genome, and the other regions were highly similar. It can be clearly observed from Figure 6, that there are large variations in rps16_trnK-UUU, trnG-UCC_atpA, atpH_atpF, trnT-GGU_psbD, psbZ_trnfM-CAU, trnV-UAC_ndhC, accD_psaI, rps3_rpl16. In general, the hotspots were more distributed in non-coding regions, while gene regions were relatively stable and conserved.

Comparison of Chloroplast Genome Boundaries and Junction Sites
JLB (LSC-IRB), JSB (SSC-IRB), JSA (SSC-IRA), JLA (LSC-IRA) are the junction sites on the boundaries of the four regions of the chloroplast genome, which are of great significance in cp genome evolution [36]. In this study, cp genome boundaries of eight Malus species were compared in IRscope. As shown in Figure 7, M. prattii has the largest cp genome with 160,239 bp, while M. zhaojiaoensis cp genome has the smallest sequence length of 159,998 bp. It can be found that the length of LSC region varies from 88,070 to 88,355 bp in eight cp genomes. In contrast, the sequence length of SSC and two IR regions varies little.
For rps19 gene at the JLB boundary, its location distribution (159 bp in LSC and 120 bp in IRB) was consistent in seven Malus species, but the length on the LSC side_rps19 (210 bp) was longer in M. yunnanensis, and sequence length on the IRB side_rps19 decreased (69 bp). At the JSB site of chloroplast genome of M. prunifolia and M. baccata, the ycf1 gene expanded to the SSC region, resulting in the increased length of ycf1 gene. The ycf1 gene of IRB/SSC locus did not cross the JSB boundary in M. toringoides, M. sieboldii, M. prattii and M. hupehensis. In addition, the displacement of trnH gene at JLA locus also occurred in different species (Figure 7). The above results prove that boundary region expansion and contraction play a role in chloroplast genome size and construction.

Phylogenetic Relationships Based on Chloroplast Genomes
Comparison of chloroplast genome sequences can provide a basis for explaining evolutionary relationships of species. In order to study the evolutionary status of M. zhaojiaoensis, chloroplast genomes of 10 species were downloaded from NCBI, including one outgroup (C. hupehensis, Crataegus). Based on the comparison of 98,748 locations in the dataset, NJ and ML trees for 11 species were constructed. Because the two types of trees were detected by bootstrap method, and their topological structures were very consistent in the results, the evolutionary branches were trustworthy. As can be seen from Figure 8, 11 species were divided into two groups according to genus, and Malus plants were grouped into one group. In the Malus genus, M. yunnanensis was an independent clade due to sequence differences, while M. zhaojiaoensis and other eight species form a large clade. The figures in the evolutionary branching shown high support for this result (Figure 8). matK and rbcL, as single-copy genes in chloroplast genome, evolved at a moderate rate and were both suitable for serving as barcodes for plant classification [37]. Hence, evolutionary relationships were characterized based on single-copy chloroplast genes matK and rbcL from 11 species (Figure 9). The results showed that M. zhaojiaoensis and eight species of Malus were grouped together, and the clades formed by the two genes were slightly different.

Discussion
With the development of high-throughput sequencing, more and more chloroplast genomes have been released [38]. Chloroplast genome is different from nuclear genome in that it has the characteristics of maternal inheritance. Because of its small sequence length and moderate base replacement, chloroplast genome has been widely used in the study of genetic variation and phylogeny [39].
In this study, the complete chloroplast genome of M. zhaojiaoensis was constructed by whole genome sequencing. Its cp size (159,998 bp) was similar to that of reported Malus species, including M. baccata (160,024 bp), M. hupehensis (160,065 bp), M. sieboldii (160,040 bp) and M. prunifolia (160,041 bp), etc. This indicates that the length of interspecific chloroplast sequence within Malus genus was relatively conserved and the variation was relatively moderate [40]. Further, by annotating the cp sequence of M. zhaojiaoensis, it was found that the ring-structured genome consists of four parts, namely LSC, SSC, IRA and IRB. Comparative analysis of different species showed that the LSC regions of the Malus cp genomes are highly variable, which contain many hot spots, for example, trnG-UCC_atpA and trnT-GGU_psbD. The proportion of SSR (75.95%) and INE (60.47%) repeat markers in the LSC region of M. zhaojiaoensis was larger than that of the other three regions, which also proved the above inference.
For chloroplast genes, 78 CDS, 29 tRNA, and 4 rRNA were annotated in M. zhaojiaoensis. But there were 111 genes in M. baccata cp genome, which included 76 protein coding genes, 31 tRNA genes and 4 rRNA genes. In addition, the number of tRNA in chloroplast of M. hupehensis was one more than that of M. zhaojiaoensis. One of the most important factors contributing to the difference in gene numbers between species may be the expansion and contraction of chloroplast genome boundaries [41]. For example, comparative analysis showed that the rps19 gene at the JLA junction site was missing in three Malus plants (M. zhaojiaoensis, M. prunifolia and M. yunnanensis).
Chloroplast coding sequences were evolutionarily conserved due to low variation. The codons of chloroplast CDS of M. zhaojiaoensis were calculated and the results suggested that the high frequency codons were mainly UUA (Leu), GCU (Ala), AGA (Arg) and so on. In Ocotea aciphylla, there were 30 high-frequency codons whose RSCU value was greater than 1, including AGA (Arg), GGA (Gly) and UCA (Ser), etc. [42]. In M. zhaojiaoensis cp, 29 of the 30 high-frequency codons ended with an A/U base, which reflects the biased use of codons. The similar phenomenon has been found in previous reports [43]. RNA editing was widely used in higher plants and was also an important means of regulating chloroplast gene expression. A total of 63 loci were identified in the coding sequences of M. zhaojiaoensis cp and distributed in 25 genes. ndh (Subunits of NADH-dehydrogenase) genes had the most potential RNA editing sites (27), followed by rpo (DNA dependent RNA polymerase) genes (10 sites). In addition, the predicted results suggested that RNA editing tends to lead to an increase in hydrophobic amino acids (Leucine, Phenylalanine), which is consistent with the study of Platanthera ussuriensis [44]. In Dipterygium glaucum and Cleome chrysantha cp, the amino acids Serine to Leucine were the majority of the conversion in RNA editing [45], which also supports the above speculation.
Chloroplast genome plays an irreplaceable role in species evolution and classification [46]. The information of chloroplast genome can be used to explain the influence of maternal inheritance [47]. The comparison of genetic evolution in Malus species has been a long-standing concern of researchers [48]. In this study, based on chloroplast genome and two single copy genes, the evolutionary relationship of Malus species was analyzed. The branches and topologies of evolutionary trees were consistent in different methods and datasets. Based on the above results, it can be clearly inferred that M. yunnanensis was located in the base group of Malus [49], and M. zhaojiaoensis formed a large group with 8 species of Malus, but was relatively independent in differentiation. In addition, there was further differentiation between M. toringoides, M. halliana and M. hupehensis, while the other five Malus species also closely influenced each other. The findings of this paper are consistent with the previous conjecture [6], indicating that M. zhaojiaoensis is relatively primitive in evolution.
Due to unique geographical conditions and climate distribution, China is rich in wild plant types and contributes significantly to the world's biodiversity. M. zhaojiaoensis is a wild Malus species of China. The exploration of biological classification and genetic relationships of M. zhaojiaoensis can speed up the process of conservation and utilization, which provides a good reference for the management and development of wild resources.

Conclusions
In this study, the complete chloroplast genome of M. zhaojiaoensis was sequenced and assembled. Based on the annotation of the cp genome, the structure and sequence characteristics of the genome were analyzed. Prediction of repeat sequence types and RNA editing sites provides potential genetic markers and regulatory loci. Hot spots, boundary features, and evolutionary relationships of M. zhaojiaoensis and Malus were described by comparative genomics and phylogenetic analysis. The release of chloroplast genome of M. zhaojiaoensis can provide valuable support for genetic variation and germplasm identification of Malus species. The research in this paper is helpful to the conservation and development of wild Malus resources.   Institutional Review Board Statement: Not applicable.

Informed Consent Statement: Not applicable.
Data Availability Statement: All data generated during this study are included in this published article.