Sequencing and Phylogenetic Analysis of the Chloroplast Genome of Three Apricot Species

The production and quality of apricots in China is currently limited by the availability of germplasm resource characterizations, including identification at the species and cultivar level. To help address this issue, the complete chloroplast genomes of Prunus armeniaca L., P. sibirica L. and kernel consumption apricot were sequenced, characterized, and phylogenetically analyzed. The three chloroplast (cp) genomes ranged from 157,951 to 158,224 bp, and 131 genes were identified, including 86 protein-coding genes, 37 rRNAs, and 8 tRNAs. The GC content ranged from 36.70% to 36.75%. Of the 170 repetitive sequences detected, 42 were shared by all three species, and 53–57 simple sequence repeats were detected with AT base preferences. Comparative genomic analysis revealed high similarity in overall structure and gene content as well as seven variation hotspot regions, including psbA-trnK-UUU, rpoC1-rpoB, rpl32-trnL-UAG, trnK-rps16, ndhG-ndhI, ccsA-ndhD, and ndhF-trnL. Phylogenetic analysis showed that the three apricot species clustered into one group, and the genetic relationship between P. armeniaca and kernel consumption apricot was the closest. The results of this study provide a theoretical basis for further research on the genetic diversity of apricots and the development and utilization of molecular markers for the genetic engineering and breeding of apricots.


Introduction
Apricot is a deciduous tree species, which belongs to section Armeniaca (Lam.)Koch genus Prunus of Rosaceae family (2n = 16) [1,2].Ten species of apricot have been identified, and they are widely distributed worldwide, including P. armeniaca, P. sibirica, P. mandshurica, P. zhengheensis, P. dasycarpa, P. holosericea, P. zhidanensis, P. mume, P. limeixing, and P. byigantina.Four distinct species are most commonly recognized: P. armeniaca, P. mandshurica, P. sibirica, and P. mume [3].Apricot is of Chinese origin and has been cultivated in China for more than 3000 years [4].This study focused on P. armeniaca,P.sibirica, and kernel consumption apricots which are widely distributed in northern China [5].The fruits of P. armeniaca are fresh with unique aroma, delicious taste, but also contains a variety of organic components, vitamins and inorganic salts, high nutritional value, wide range of uses, and can be processed into dried apricots, cultivated around the world, accounting for a high proportion of global fruit production [6,7].P. sibirica has high ecological value as a pioneer tree species for vegetation restoration because it is cold, drought, and poor-soil tolerant.In addition, the bitter kernel of P. sibirica has an amygdalin content of 3.5-7.6%,and it is also rich in vitamins, selenium, calcium, phosphorus, iron, potassium and other nutrients.Thus, this kernel is a raw material in traditional Chinese medicine [8].The kernel consumption apricots have typical characteristic of large and sweet kernels.And the kernels have a crude fat content of ~60% and thus can be used to produce kernel oil, and they have a protein content of ~30%, including eight kinds of essential amino acids for the human body.Thus, kernels are a high-quality plant protein raw material [9].
Chloroplasts are unique endosymbiotic organelles found in plants and photosynthetic algae [10], serving as the primary site of photosynthesis and supplying energy for plant growth and development and carbon intermediates for a number of critical metabolic reactions.In addition, chloroplasts play an important role in plant response to light, heat, drought, salt, and other stresses [11].The chloroplast (cp) genome is maternally inherited in most angiosperms or paternally inherited in some gymnosperms.The sequence analysis of the double-stranded circular DNA cp genome is important in various areas of study, including the development of linked molecular markers, the reconstruction of phylogenetic relationships, and the genetic engineering and breeding of plants [12].The cp genome exhibits a highly conserved organization composed of a pair of inverted repeats (IRs), a large single-copy (LSC) region, and a small single-copy (SSC) region [13,14].The two IR regions, which are separated by the LSC and SSC regions, are equal in length and opposite in direction; however, variations have been observed in some plants, mainly presented as IR loss, contraction, expansion, and sequence direction changes [15].Early research studies on the cp genome focused mainly on understanding the evolutionary history of chloroplasts and safeguarding uncommon and endangered plants [16].With improvements in sequencing technology, the complete cp genome of Nicotiana tabacum was obtained for the first time in 1986 [17].More recently, the cp genomes of Prunus cerasus (sour cherry) [18], P. phaeosticta (dark-spotted cherry) [19], P. kansuensis (Chinese bush peach) [20], and P. japonica (Japanese bush cherry) [21] were sequenced and analyzed, and their phylogenetic positions and genetic relationships were determined.
Morphological characteristic analyses can preliminarily reveal the morphological characteristics and genetic variations of plants [22].However, the morphological characteristics of apricot are influenced by the environment and gene dominance, and the period required to obtain morphological characteristics is long [23].With the rapid development of nextgeneration sequencing technology and phylogenetic genomics, cp genome sequencing has been widely used in molecular evolution and phylogenetic studies of many plant species [24].More accurate classifications and phylogenetic relationships of apricot can be obtained through the combination of cp genome sequencing and phylogenetic genomics.
In this study, P. armeniaca, P. sibirica, and kernel consumption apricots were used as the research objects to obtain the cp genome sequences, and then, the sequences were spliced, annotated, and compared.A phylogenetic analysis was performed, and the evolutionary relationship between P. armeniaca, P. sibirica, and kernel consumption apricot was systematically studied at the cp genome sequence level.The results provide a reference for future taxonomic and phylogenetic analyses and molecular marker development of apricot and a molecular guide for genetic engineering and breeding.

Sample Material Collection, DNA Extraction, and Sequencing
Fresh tender leaves were collected from the cultivated variety P. armeniaca in Mentougou, Beijing (Sungold), wild resource of P. sibirica in Wanjiagou (F106), Inner Mongolia, and the cultivated variety of kernel consumption apricot growing in Wei County, Hebei (Youyi).The samples were stored at −80 • C.
Total genomic DNA was extracted using a Plant Genomic DNA Kit (Tiangen, Beijing, China).DNA quality and quantity were detected using a NanoDrop 2000 spectrophotometer (Thermo Fisher Scientific, Waltham, MA, USA) and 0.8% agarose gel electrophoresis, and the DNA was further fragmented for sequence library preparation following fragment Genes 2023, 14, 1959 3 of 14 purification and end repair.After testing the library preparation, DNA sequencing was performed using the Illumina HiSeq X high-throughput platform (Illumina, San Diego, CA, USA).Library preparation and sequencing were performed by BGI Genomics (Shenzhen, China).

Chloroplast Genome Assembly and Annotation
Using SOAPdenovo (http://soap.genomics.org.cn/soapdenovo.html,accessed on 5 September 2016), the reads were mapped to the cp genome of P. persica, which was downloaded from GenBank (NC_014697.1).Contigs obtained by de novo assembly mapping to the consensus sequence were obtained using the reference genome to check the errors or ambiguities resulting from either assembly method.Gapcloser (https://sourceforge. net/projects/soapdenovo2/files/GapCloser/, accessed on 7 September 2016) was used to modify the spaces between long contigs to obtain a complete cp genome.The three apricot cp genome sequences were preliminarily annotated using DOGMA and CpGAVAS (http://phylocluster, accessed on 11 September 2016), and the annotation was completed by manually modifying the start and stop codons of individual genes.Geneious was used for manual corrections.Finally, the annotated cp genomes of the three apricot species were submitted to GenomeVx (http://wolfe.gen.tcd.ie/GenomeVx/,accessed on 10 March 2023) to complete the physical mapping.

Chloroplast Genome Comparison and Analysis of Variations in IR/SC Boundaries
mVISTA (http://genome.lbl.gov/vista/index.shtml,accessed on 25 March 2023) was used to compare similarities and variations in the cp genomes among P. armeniaca, P. sibirica, kernel consumption apricot, P. mume, P. pyrifolia, and N. tabacum, with the P. persica cp genome sequence serving as the reference sequence.Based on the annotation information of the cp genome, the LSC, SSC, and IR boundary sequences in the apricot cp genomes were compared with those in the P. mume, P. persica, P. pyrifolia, and N. tabacum cp genomes.The IR-SC boundary of the cp genome was visualized using IRscope.

Gene Categories
Gene Groups Gene Names
The three apricot cp genomes contained three forms of SSRs: mononucleotide, dinucleotide, and compound (Figure 3a).There were 53 SSRs in P. sibirica, 56 in P. armeniaca, and 57 in kernel consumption apricot.Mononucleotide repeats (ranging from 84.91% in P. sibirica to 94.64% in P. armeniaca) occurred most frequently, followed by dinucleotide (ranging from 3.57% in P. armeniaca to 10.53% in kernel consumption apricot) and compound SSRs (ranging from 1.76% in P. armeniaca to 3.77% in P. sibirica).The number of A/T mononucleotide repeats (ranging from 78.95% in kernel consumption apricot to 81.13% in P. sibirica) was greater than that of C/G repeats (ranging from 8.77% in kernel consumption apricot to 9.43% in P. sibirica).The quantity of dinucleotide repeats, including AT/TA repeats, ranged from 5.66% in P. sibirica to 8.93% in P. armeniaca (Supplemental Table S4).We further analyzed SSR distribution and found that most were distributed in the LSC region (83.93-85.96%);far fewer were in the SSC (10.53-12.50%)and IR (3.51-3.77%)regions (Figure 3b and Supplemental Table S4).

Comparative Analysis of Apricot Chloroplast Genomes
A comparison of the three apricot cp genomes indicated that the coding region is relatively conserved, with variations mainly occurring in the intergenic and intron regions.Intergenic spacer regions involving the psbA-trnK-UUU, rpoC1-rpoB, rpl32-trnL-UAG, trnK-rps16, ndhG-ndhI, ccsA-ndhD, and ndhF-trnL genes are hotspots for apricot cp genome variation (Figure 4).These hotspots can provide vital sequence information for the design of screening DNA barcodes and phylogenetic analyses of apricot species.The three apricot cp genomes contained three forms of SSRs: mononucleotide, dinucleotide, and compound (Figure 3a).There were 53 SSRs in P. sibirica, 56 in P. armeniaca, and 57 in kernel consumption apricot.Mononucleotide repeats (ranging from 84.91% in P. sibirica to 94.64% in P. armeniaca) occurred most frequently, followed by dinucleotide (ranging from 3.57% in P. armeniaca to 10.53% in kernel consumption apricot) and compound SSRs (ranging from 1.76% in P. armeniaca to 3.77% in P. sibirica).The number of A/T mononucleotide repeats (ranging from 78.95% in kernel consumption apricot to 81.13% in P. sibirica) was greater than that of C/G repeats (ranging from 8.77% in kernel consumption apricot to 9.43% in P. sibirica).The quantity of dinucleotide repeats, including AT/TA repeats, ranged from 5.66% in P. sibirica to 8.93% in P. armeniaca (Supplemental Table S4).We further analyzed SSR distribution and found that most were distributed in the LSC region (83.93-85.96%);far fewer were in the SSC (10.53-12.50%)and IR (3.51-3.77%)regions (Figure 3b and Supplemental Table S4).

Comparative Analysis of Apricot Chloroplast Genomes
A comparison of the three apricot cp genomes indicated that the coding region is relatively conserved, with variations mainly occurring in the intergenic and intron regions.Intergenic spacer regions involving the psbA-trnK-UUU, rpoC1-rpoB, rpl32-trnL-UAG, trnK-rps16, ndhG-ndhI, ccsA-ndhD, and ndhF-trnL genes are hotspots for apricot cp genome variation (Figure 4).These hotspots can provide vital sequence information for the design of screening DNA barcodes and phylogenetic analyses of apricot species.A comparison of the three apricot cp genomes indicated that the coding region relatively conserved, with variations mainly occurring in the intergenic and intron r gions.Intergenic spacer regions involving the psbA-trnK-UUU, rpoC1-rpoB, rpl32-trn UAG, trnK-rps16, ndhG-ndhI, ccsA-ndhD, and ndhF-trnL genes are hotspots for apricot genome variation (Figure 4).These hotspots can provide vital sequence information f the design of screening DNA barcodes and phylogenetic analyses of apricot species.

Analysis of Variations in the IR/SC Boundaries
We compared the cp genome IR regions of P. armeniaca, P. sibirica, and kernel consumption apricot with those of P. mume, P. persica, P. pyrifolia, and N. tabacum (Figure 5).The rps19 gene was detected at the IRb/LSC boundary in P. armeniaca, P. sibirica, kernel consumption apricot, P. mume, P. persica, and P. pyrifolia.The fragment size in the IRb region was 120-197 bp.In contrast, in N. tabacum, there was no pseudogene of rps19 in the IRb/LSC boundary region in N. tabacum.The ycf1 gene was found in the IRa/SSC boundary of P. armeniaca, P. sibirica, kernel consumption apricot, P. mume, P. persica, P. pyrifolia, and N. tabacum.The fragment size of Ψycf1 in the IRa region was 996-1073 bp.The Ψycf1 pseudogene in the IRb/SSC region of P. armeniaca, P. sibirica, kernel consumption apricot, P. mume, P. persica, and P. pyrifolia exhibited different lengths of overlap with that of ndhF, whereas ycf1 of N. tabacum did not overlap with that of ndhF.

Phylogenetic Analysis
According to the constructed phylogenetic tree, the support rate for each branch was high (>80%), while the support rate for 21 of the 27 nodes was >90% (Figure 6).The 29 species analyzed were divided into seven groups: EUROSIDS I, EUROSIDS II, EUAS-TERIDS II, EUASTERIDS I, basal angiosperms, monocots, and gymnosperms.Rosales, Cucurbitales, Fabales, and Malpighiales were clustered together to form EUROSIDS I, and other plants were clustered together in turn.Gymnosperms and basal angiosperms were obviously clustered into one branch.In the phylogenetic tree, P. armeniaca, P. sibirica, and kernel consumption apricot clustered in the Rosales order within EUROSIDS I.The analysis shows that the relationship between P. armeniaca and kernel consumption apricot is the closest.
The rps19 gene was detected at the IRb/LSC boundary in P. armeniaca, P. sibirica, ke consumption apricot, P. mume, P. persica, and P. pyrifolia.The fragment size in the region was 120-197 bp.In contrast, in N. tabacum, there was no pseudogene of rps19 in IRb/LSC boundary region in N. tabacum.The ycf1 gene was found in the IRa/SSC bound of P. armeniaca, P. sibirica, kernel consumption apricot, P. mume, P. persica, P. pyrifolia, N. tabacum.The fragment size of Ψycf1 in the IRa region was 996-1073 bp.The Ψ pseudogene in the IRb/SSC region of P. armeniaca, P. sibirica, kernel consumption apri P. mume, P. persica, and P. pyrifolia exhibited different lengths of overlap with that of n whereas ycf1 of N. tabacum did not overlap with that of ndhF.IDS II, EUASTERIDS I, basal angiosperms, monocots, and gymnosperms.Rosales, Cucurbitales, Fabales, and Malpighiales were clustered together to form EUROSIDS I, and other plants were clustered together in turn.Gymnosperms and basal angiosperms were obviously clustered into one branch.In the phylogenetic tree, P. armeniaca, P. sibirica, and kernel consumption apricot clustered in the Rosales order within EUROSIDS I.The analysis shows that the relationship between P. armeniaca and kernel consumption apricot is the closest.

Discussion
In this study, the cp genomes of three apricot species were successfully sequenced, assembled, analyzed, and compared.The structural characteristics of the three cp genomes were similar, which exhibited typical tetrad structures [25,26].The total length of the cp genomes ranged from 157,951 to 158,224 bp.The number of genes was consistent, with 131 genes each, including 86 protein-coding genes, 37 tRNAs, and 8 rRNAs, which is consistent with the previously reported cp genomes of P. mume [27], P. persica [26], and P. pyrifolia [28].The GC content of the three apricot cp genomes was also similar (36.70-36.75%),and the GC content in the IR region was the highest (42.57-42.59%)owing to the GC-rich rRNA and tRNA in the IR region.The star is marked by the study of three apricots.

Discussion
In this study, the cp genomes of three apricot species were successfully sequenced, assembled, analyzed, and compared.The structural characteristics of the three cp genomes were similar, which exhibited typical tetrad structures [25,26].The total length of the cp genomes ranged from 157,951 to 158,224 bp.The number of genes was consistent, with 131 genes each, including 86 protein-coding genes, 37 tRNAs, and 8 rRNAs, which is consistent with the previously reported cp genomes of P. mume [27], P. persica [26], and P. pyrifolia [28].The GC content of the three apricot cp genomes was also similar (36.70-36.75%),and the GC content in the IR region was the highest (42.57-42.59%)owing to the GC-rich rRNA and tRNA in the IR region.
Repeat sequences play an important evolutionary role in the cp genome by promoting cp genome rearrangement, inducing genomic structural changes, and increasing population genetic diversity [29,30].There were 170 repeats in the cp genome, among which, F, R, and C repeats were the main repetitive sequences (97.65%).Different abundances of palindromic repeats in the cp genomes may provide additional evolutionary information, as the presence and abundance of repeats in the cp genome may contain phylogenetic signals [31].The total number and proportion of repeat types in the three apricot species showed a similar pattern, suggesting a similar evolutionary history and closer affinities among these species.SSRs related to genome rearrangement and recombination are widely distributed in the cp genome, and they are prone to dislocation during DNA replication, which leads to rich polymorphisms that provide information for marker development in population genetics and evolutionary research [24].SSRs of plant chloroplast genes are mainly mononucleotide and dinucleotide repeats, and trinucleotide to hexanucleotide repeats are relatively less than mononucleotide and dinucleotide repeats [32].The SSRs found in the three apricot species were dominated by mononucleotide repeats, especially poly (A/T), which is similar to that of other Rosaceae species [33].The repeat sequences and SSRs detected in this study can provide useful information for future research on the evolution of apricot species.
Previous studies have shown that changes in genome size mainly occur in the SSC and LSC regions and are highly conserved in the IR regions [34,35].Variation in the noncoding region was significantly higher than that in the coding region owing to the large selection pressure [36].In particular, the intergenic regions, including psbA-trnK-UUU, rpoC1-rpoB, rpl32-trnL-UAG, trnK-rps16, ndhG-ndhI, ccsA-ndhD, and ndhF-trnL, are highly variable regions, which can be used as DNA barcodes for future phylogenetic analyses of apricots.
Contraction and expansion are the main causes of cp genome evolution, mainly occurring at the rps19, ycf1, trnH-GUG, and ndhF positions [37,38].Although the gene distribution of the four main regional boundaries in the three apricot cp genomes showed the same pattern, there were differences in the microstructure, especially the location of rps19, ycf1, ndhF, and trnH-GUG.The rps19 crosses the boundary between the LSC region and the IR region, which is similar to previous observations in P. mume, P. armeniaca, and P. salicina [39].The differences in the lengths of these four genes in the IR/SC boundary region can be used to identify P. armeniaca, P. sibirica, and kernel consumption apricot.
The cp genome of angiosperms is maternally inherited and an independent evolutionary system with a moderate rate and can be used for phylogenetic analyses of each classification level [40].Especially in the study of the phylogenetic relationship between angiosperms and some controversial species, cp genome analysis is the preferred research method [41].The results of the phylogenetic analysis showed that the 29 species included in the phylogenetic tree could be divided into 7 groups: EUROSIDS I, EUROSIDS II, EU-ASTERIDS II, EUASTERIDS I, basal angiosperms, monocots, and gymnosperms.The support rate for each branch was high (>80%), in which Rosales, Cucurbitales, Fabales, and Malpighiales were clustered together to form EUROSIDS I, which is consistent with the results of APG III [42].Phylogenetic tree analysis in our study showed that P. armeniaca, P. sibirica, and kernel consumption apricot all clustered together, which was consistent with the results of traditional morphological analysis [43] and genetic diversity analysis [44].This study has certain limitations, and the taxonomic status of apricot needs to be further analyzed to more clearly reveal the taxonomic status and origin of apricot.

Conclusions
In this study, we sequenced and analyzed the complete cp genome of three apricots.The results showed that the cp genomes showed a typical tetrad structure.Comparative analysis of the cp genomes revealed that the organization and gene order were highly conserved.The cp genome size, GC content, gene number, and gene arrangement order were similar in the three apricot species.We detected abundant long-repeat sequences and SSR loci in the three apricot species.The IR/SC boundary regions were similar but also exhibited some microstructural differences among the three species.Seven significant differences were identified in the non-coding regions of the three cp genomes, which can be exploited in the DNA barcoding of apricot.Finally, phylogenetic tree analysis supported a close relationship among the three apricot species.The results are of great significance for studies on the internal structure of the cp genome of apricots and the breeding, environmental adaptation, and hybrid breeding of apricots.

Supplementary Materials:
The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/genes14101959/s1,Supplemental Table S1: Species included in the phylogenetic analysis; Supplemental Table S2: Intron lengths in the chloroplast genomes of three apricots; Supplemental Table S3: Types of long repetitive sequences in three apricots; Supplemental Table S4: SSR information for three apricots.

Figure 1 .
Figure 1.Combined gene map of the chloroplast genome of the three apricot species.Figure 1.Combined gene map of the chloroplast genome of the three apricot species.

Figure 1 .
Figure 1.Combined gene map of the chloroplast genome of the three apricot species.Figure 1.Combined gene map of the chloroplast genome of the three apricot species.

Figure 2 .
Figure 2. Analysis of repeat sequences in the chloroplast genomes of the three apricot species.(a) Frequency of the repeat type.(b) Frequency of repeat sequences by length.(c) Number of common and unique chloroplast genome repeat sequences.

Figure 2 . 14 Figure 3 .
Figure 2. Analysis of repeat sequences in the chloroplast genomes of the three apricot species.(a) Frequency of the repeat type.(b) Frequency of repeat sequences by length.(c) Number of common and unique chloroplast genome repeat sequences.Genes 2023, 14, x FOR PEER REVIEW 8 of 14

Figure 3 .
Figure 3. Frequency of SSRs in three apricot species.(a) Number of SSRs by type.(b) Number of SSRs by genome region.

Figure 6 .
Figure 6.Phylogenetic relationship of the three apricot species reconstructed using the maximum likelihood (ML) method.★ The star is marked by the study of three apricots.

Figure 6 .
Figure 6.Phylogenetic relationship of the three apricot species reconstructed using the maximum likelihood (ML) method.The star is marked by the study of three apricots.

Table 1 .
Chloroplast genome characteristics of three apricot species.

Table 2 .
Genes present in the chloroplast genomes of the three apricot species.