New Insight into the Phylogeny and Taxonomy of Cultivated and Related Species of Crataegus in China, Based on Complete Chloroplast Genome Sequencing

: Hawthorns ( Crataegus L.) are one of the most important processing and table fruits in China, due to their medicinal properties and health beneﬁts. However, the interspeciﬁc relationships and evolution history of cultivated Crataegus in China remain unclear. Our previously published data showed C. bretschneideri may be derived from the hybridization of C. pinnatiﬁda with C. maximowiczii , and that introgression occurs between C. hupehensis , C. pinnatiﬁda , and C. pinnatiﬁda var. major . In the present study, chloroplast sequences were used to further elucidate the phylogenetic relationships of cultivated Crataegus native to China. The chloroplast genomes of three cultivated species and one related species of Crataegus were sequenced for comparative and phylogenetic analyses. The four chloroplast genomes of Crataegus exhibited typical quadripartite structures and ranged from 159,607 bp ( C. bretschneideri ) to 159,875 bp ( C. maximowiczii ) in length. The plastomes of the four species contained 113 genes consisting of 79 protein-coding genes, 30 tRNA genes, and 4 rRNA genes. Six hypervariable regions ( ndhC-trnV(UAC)-trnM(CAU) , ndhA , atpH-atpI , ndhF , trnR(UCU)-atpA , and ndhF-rpl32 ), 196 repeats, and a total of 386 simple sequence repeats were detected as potential variability makers for species identiﬁcation and population genetic studies. In the phylogenomic analyses, we also compared the entire chloroplast genomes of three published Crataegus species: C. hupehensis (MW201730.1), C. pinnatiﬁda (MN102356.1), and C. marshallii (MK920293.1). Our phylogenetic analyses grouped the seven Crataegus taxa into two main clusters. One cluster included C. bretschneideri , C. maximowiczii , and C. marshallii , whereas the other included C. hupehensis , C. pinnatiﬁda , and C. pinnatiﬁda var. major . Taken together, our ﬁndings indicate that C. maximowiczii is the maternal origin of C. bretschneideri . This work provides further evidence of introgression between C. hupehensis , C. pinnatiﬁda , and C. pinnatiﬁda var. major , and suggests that C. pinnatiﬁda var. major might have been artiﬁcially selected and domesticated from hybrid populations, rather than evolved from C. pinnatiﬁda . analyze the whole chloroplast genomes of C. C. C. C. phylogenetic C. microsatellites


Introduction
The plants from genus Crataegus L. (hawthorn), a member of the Rosaceae family, are widely distributed in Eurasia and North America [1]. Hawthorns are one of the most widely consumed horticultural crops in China, either in fresh or processed form, due to their pleasant flavor, attractive color, and rich nutrition [2,3]. In addition, hawthorn is an important raw material for functional foods and has been used as herbal medicines in the

Chloroplast Genome Sequencing, Assembly, Annotation, and Visualization
The genomic DNAs was purified and end-repaired. The PCR products were used to build a 300 bp insert size library using Illumina Nextera XT. The complete chloroplast genomes of the four Crataegus species were sequenced using Illumina high-throughput sequencing platform (HiSeq 4000). After sequencing, the raw reads were assembled into whole chloroplast genomes in a multi-step approach employing a pipeline that involved a combination of both reference guided and de novo assembly approaches. First, pairedend sequence reads were trimmed to remove adaptors and low-quality sequences using Trimmomatic 0.39 [33] with the following parameters: LEADING = 20, TRAILING = 20, SLIDINGWINDOW = 4:15, MINLEN = 36, and AVGQUAL = 20. Second, contigs were assembled from the high quality paired-end reads by using SPAdes software version 3.6.1 [34] (Kmer = 95). Third, the relative order and orientation of the chloroplast genome contigs were determined by BLAST searches against the chloroplast genome of C. hupehensis (MW201730) [35]. Subsequently, the selected chloroplast-like contigs were assembled with Sequencher 4.10. (https://www.genecodes.com/, accessed on 2 June 2021). Then, the Crataegus chloroplast genomes were manually edited and annotated with the commandline Perl script Plann [36]. Comparison with C. hupehensis (MW201730)'s homologous genes determined intron positions, putative starts and stops, and initial annotation. Finally, a circular map of the chloroplast genome was illustrated using Genome Vx [37]. Default parameters were used for SPAdes, BLASTing, Sequencher 4.10, Plann, and Genome Vx.

Analysis of Microsatellites and Repeat Sequences
The MIcroSAtellite program (MISA) (http://pgrc.ipk-gatersleben.de/misa/misa.html, accessed on 15 June 2021) was applied to identified SSRs in the Crataegus chloroplast genomes. The repeat number thresholds were set as follows: Three repeat units for tetra-, penta-, and hexa-nucleotid; four repeat units for trinucleotide; five repeat units for dinucleotide; and 10 repeat units for mononucleotide SSR motifs, respectively. REPuter (https://bibiserv.cebitec.uni-bielefeld.de/reputer, accessed on 15 June 2021) software was used to find and analyze the size and the positions of repeats (forward, complement, palindromic, and reverse) within the Crataegus chloroplast genomes [38].

Variation Hotspots Detection and Sequence Divergence Analysis
The four Crataegus chloroplast genomes were aligned by MAFFT software version 7 [39] with default parameters and adjusted manually where necessary. Sliding window analysis was conducted in DnaSP software version 6.0 [40] to evaluate plastomic nucleotide variability. The window length was set to 600 bp, while the step size was set to 100 bp. MEGA 7.0 software [41] was used to determine the variable and parsimony-informative base sites in the LSC, SSC, and IR regions of the four Crataegus chloroplast genomes, as well as the whole chloroplast genomes.The p-distances among four Crataegus chloroplast genomes was calculated to assess divergence of Crataegus species with MEGA software.

Comparative Genome Analysis
mVISTA software [42] was used to compare the complete chloroplast genomes of the four Crataegus species. C. bretschneideri was regarded as the reference sequence. On the basis of annotations, the chloroplast genome borders between the IR, LSC and SSC, and IR regions of the four Crataegus species were also compared and illustrated.

Phylogenomic Analysis
The Bayesian inference (BI) and maximum likelihood (ML) methods wereemployed to plot phylogenetic trees with the complete chloroplast genomes. The entire chloroplast genomes sequences of seven species of Crataegus and another 25 plastomes of species of Maloideae were used in the phylogenetic analyses, with four plastomes of the sister group Rosoideae as an outgroup. The best-fitting substitution model GTR+F+I+G4 was employed using Modelfinder [43] on the basis of Bayesian information criterion. IQ-TREE software was used to perform ML calculations [44]. Bootstrap analysis was conducted with 1000 replicates. BI analysis was conducted by MrBayes software. Bayesian analysis was run for 10,000,000 generations with sampling trees every 1000 generations and the first 25% were removed as burn-in. The average standard deviation of the split frequencies was >0.01.

Genome Organization and Features
The complete chloroplast genome sequences of the four Crataegus species were similar in size, ranging from 159,607 bp for C. bretschneideri to 159,875 bp for C. maximowiczii ( Figure 1; Table 2). All four chloroplast genomes contained a typical quadripartite structure and is composed of a pair of IRs (26,384 bp), which are separated by LSC (87,601-87,874 bp) and SSC (19,312 bp) regions. The fully annotated genome sequences have been submitted to the GenBank database (accession numbers, MW963339 for C. bretschneideri, MZ494512 for C. maximowiczii, MZ494514 for C. pinnatifida, and MZ494513 for C. pinnatifida var. major).
Results of the genome annotation revealed a total of 113 genes in the four chloroplast genomes of Crataegus, consisting of 79 coded proteins, fourribosomal RNA, and 30 tRNA genes ( Figure 1; Table 3 Among the pair of inverted repeats, four rRNA genes, seven tRNA genes, and eightprotein genes are presented in the IRb repeat in SSC region ( Figure 1; Table 3). Fourteen genes have one single intron while twogenes have two introns (clpP and ycf3) ( Figure 1; Table 3). The rps12 gene was found to be a trans-spliced gene. Its 5 -end exon is located in the LSC region and its 3 -end is duplicated in the IR region ( Figure 2). The trnK-UUU gene has the longest intron, and ycf1 has the shortest intron.

Divergence Analysis of Sequence and High Variation Region
Next, genome-wide comparative analyses of the four Crataegus chloroplast genomes were performed using mVISTA to evaluate the level of sequence divergence ( Figure 3). The chloroplast genomes exhibit strong sequence similarity, indicating that the plastomes are highly conserved. Compared to the non-coding regions and single-copy, the coding regions and IR are more conserved, with low variation among Crataegus. Moreover, the coding regions of ndhA and ycf1 are more variable compared with those of other genes.

IR Expansion and Shrinkage
Expansion and shrinkage of the inverted repeats region is a crucial aspect of the plastomes, which is the significant reason for the different sizes of the plastomes and can be used for phylogenetic study in plants. Additionally, the expansion and shrinkage of IR boundaries are evolutionary events that result in variation of chloroplast genome size.
Comparison details of the IR-LSC and IR-SSC borders of the four Crataegus plastomes are shown in Figure 2. The LSC-IRa boundary is indicated by the presence of the rps19 gene. In C. pinnatifida, C. maximowiczii, and C. pinnatifida var. major, 159 bp of the rps19 gene can be found in the LSC region. However, in C. bretschneideri, 165 bp of rps19 are located within the LSC region. The SSC-IRa boundary is characterized by two genes-ycf1 (fragmented) and ndhF-which exhibit overlapping coding regions ( Figure 2). The overlap of these two genes is conserved (20 bp) in each of the four Crataegus chloroplast genomes ( Figure 2). ndhF is situated in the SSC region and 2244-2253 bp in length. In the four Crataegus chloroplast genomes, the SSC-IRb junction contains the full-length ycf1 gene, whereas the SSC-IRb junction contains the rps19 (fragmented) and trnH genes ( Figure 2).

Divergence Analysis of Sequence and High Variation Region
Next, genome-wide comparative analyses of the four Crataegus chloroplast genomes were performed using mVISTA to evaluate the level of sequence divergence ( Figure 3). The chloroplast genomes exhibit strong sequence similarity, indicating that the plastomes are highly conserved. Compared to the non-coding regions and single-copy, the coding regions and IR are more conserved, with low variation among Crataegus. Moreover, the coding regions of ndhA and ycf1 are more variable compared with those of other genes.

Divergence Analysis of Sequence and High Variation Region
Next, genome-wide comparative analyses of the four Crataegus chloroplast genomes were performed using mVISTA to evaluate the level of sequence divergence ( Figure 3). The chloroplast genomes exhibit strong sequence similarity, indicating that the plastomes are highly conserved. Compared to the non-coding regions and single-copy, the coding regions and IR are more conserved, with low variation among Crataegus. Moreover, the coding regions of ndhA and ycf1 are more variable compared with those of other genes.  Additionally, single nucleotide substitutions ( Figure 4) and nucleotide diversity were compared (Table 4). Four-hundred-and-forty-fivevariable sites (0.28%) and 331 parsimony-informative sites (0.21%) were detected in the four Crataegus plastomes. There were 228 and 87 parsimony-informative sites in the LSC and SSC regions, while only 16 parsimonyinformative sites were detected in IR regions. The SSC region exhibited the highest nucleotide diversity (0.0036), followed by the LSC region (0.0022) and the IR region (0.0003). The average nucleotide diversity value was 0.0017.

Repeat Structure and SSR Analysis
Next, repeat sequences were examined in the four Crataegus plastomes ( Figure 6; Table S1). A total of 196 repeat sequences, containing forward, reverse, complement, and palindromic repeats, were observed among the four Crataegus plastomes. Among all the detected repeats, forward (48.5%) and palindromic repeats (43.4%) are relatively common, whereas reverse (7.1%) and complement repeats (1%) are comparatively rare ( Figure 6; Table S1). One pair of complement repeats is only present in C. pinnatifida var. major and C. pinnatifida. The sizes of the repeats among the four plastomes vary from 30 to 63 bp. The majority of repeats (68.9%) are limited to 30-34 bp in size ( Figure 6; Table S1).
We found 386 SSRs repeat motifs in the plastomes of the four Crataegus species using MISA software (Figure 7; Table S2). The number of SSRs detected ranged from 94 (C. bretschneideri) to 98 (C. pinnatifida var. major) (Figure 7; Table S2). Among these SSRs, most are located in the LSC regions (313 SSRs), followed by SSC regions (41 SSRs) and IR regions (32 SSRs). We also observed that a majority of the SSRs are situated in the spacers (288 SSRs), while 54 SSRs and 44 SSRs are situated in the introns and exons, respectively. The majority of SSRs are mononucleotide repeats, which account for the total number of SSRs at 70.98%, followed by dinucleotide repeats at 20.98 and tetranucleotide repeats at 54.4% ( Figure 7D; Table S2). The A/T mononucleotide repeats are the most abundant SSRs in the four Crataegus species ( Figure 7C; Table S2). The TA/AT dinucleotide repeats are the second most common SSRs, followed by mononucleotide C and tetranucleotide TTTA ( Figure 7C; Table S2).

Repeat Structure and SSR Analysis
Next, repeat sequences were examined in the four Crataegus plastomes (Figure 6; Table S1). A total of 196 repeat sequences, containing forward, reverse, complement, and palindromic repeats, were observed among the four Crataegus plastomes. Among all the detected repeats, forward (48.5%) and palindromic repeats (43.4%) are relatively common, whereas reverse (7.1%) and complement repeats (1%) are comparatively rare ( Figure 6; Table S1). One pair of complement repeats is only present in C. pinnatifida var. major and C. pinnatifida. The sizes of the repeats among the four plastomes vary from 30 to 63 bp. The majority of repeats (68.9%) are limited to 30-34 bp in size ( Figure 6; Table S1). We found 386 SSRs repeat motifs in the plastomes of the four Crataegus species using MISA software (Figure 7; Table S2). The number of SSRs detected ranged from 94 (C. bretschneideri) to 98 (C. pinnatifida var. major) (Figure 7; Table S2). Among these SSRs, most are located in the LSC regions (313 SSRs), followed by SSC regions (41 SSRs) and IR regions (32 SSRs). We also observed that a majority of the SSRs are situated in the spacers (288 SSRs), while 54 SSRs and 44 SSRs are situated in the introns and exons, respectively. The majority of SSRs are mononucleotide repeats, which account for the total number of SSRs at 70.98%, followed by dinucleotide repeats at 20.98 and tetranucleotide repeats at

Repeat Structure and SSR Analysis
Next, repeat sequences were examined in the four Crataegus plastomes (Figure 6; Table S1). A total of 196 repeat sequences, containing forward, reverse, complement, and palindromic repeats, were observed among the four Crataegus plastomes. Among all the detected repeats, forward (48.5%) and palindromic repeats (43.4%) are relatively common, whereas reverse (7.1%) and complement repeats (1%) are comparatively rare ( Figure 6; Table S1). One pair of complement repeats is only present in C. pinnatifida var. major and C. pinnatifida. The sizes of the repeats among the four plastomes vary from 30 to 63 bp. The majority of repeats (68.9%) are limited to 30-34 bp in size ( Figure 6; Table S1). We found 386 SSRs repeat motifs in the plastomes of the four Crataegus species using MISA software (Figure 7; Table S2). The number of SSRs detected ranged from 94 (C. bretschneideri) to 98 (C. pinnatifida var. major) (Figure 7; Table S2). Among these SSRs, most are located in the LSC regions (313 SSRs), followed by SSC regions (41 SSRs) and IR regions (32 SSRs). We also observed that a majority of the SSRs are situated in the spacers (288 SSRs), while 54 SSRs and 44 SSRs are situated in the introns and exons, respectively. The majority of SSRs are mononucleotide repeats, which account for the total number of SSRs at 70.98%, followed by dinucleotide repeats at 20.98 and tetranucleotide repeats at 54.4% ( Figure 7D; Table S2). The A/T mononucleotide repeats are the most abundant SSRs in the four Crataegus species ( Figure 7C; Table S2). The TA/AT dinucleotide repeats are the second most common SSRs, followed by mononucleotide C and tetranucleotide TTTA ( Figure 7C; Table S2).

Phylogenetic Analysis
Phylogenetic analysis was conducted using 36 entire chloroplast genomes, including those from sevenCrataegus species, 25 other Maloideae species, and four Rosoideae chloroplast genomes as the outgroup. The phylogenetic trees obtained from the ML and BI had similar topologies (Figure 8), showing two major branches. Maloideae genera (Amelanchier, Crataegus, Pyrus, Sorbus, Cotoneaster, Photinia, Eriobotrys, Osteomeles, Malus, and Cydonis) formed a monophyletic group. Within the Maloideae clade, Crataeguswas placed as closely related to the genus Amelanchier. Crataegus species constituted a monophyletic group with 100% support.

Phylogenetic Analysis
Phylogenetic analysis was conducted using 36 entire chloroplast genomes, including those from seven Crataegus species, 25 other Maloideae species, and four Rosoideae chloroplast genomes as the outgroup. The phylogenetic trees obtained from the ML and BI had similar topologies (Figure 8), showing two major branches. Maloideae genera (Amelanchier, Crataegus, Pyrus, Sorbus, Cotoneaster, Photinia, Eriobotrys, Osteomeles, Malus, and Cydonis) formed a monophyletic group. Within the Maloideae clade, Crataegus was placed as closely related to the genus Amelanchier. Crataegus species constituted a monophyletic group with 100% support.

Genome Features and Sequence Divergence among CrataegusSpecies
In present study, we sequenced the whole plastid genomes of four Crataegus species using Illumina HiSeq 4000 platform. The sizes of the four plastomesranged from 159,607 bp (C. bretschneideri) to 159,875 bp (C. maximowiczii). The chloroplast genomes of the four Crataegusspeciesc contain 113 genes, consisting of79 protein-coding genes, fourrRNA genes, and 30 tRNAgenes. The ycf15 and ycf68 genes were not annotated because these two genes were identified as pseudogenes comprising several internal stop codons [24]. In some species, ycf2, rpl23, and accD are missing from the chloroplast genome [22,45,46]; however, these genes are indeed present in Crataegus.Similar to most plants, the plastomes of the four Crataegus species are conserved, and no rearrangement events were found. The results of mVISTA and nucleotide diversity analyses revealed high levels of similarity among the plastomes, indicating that divergence of the four Crataegus plastomes is lower

Genome Features and Sequence Divergence among Crataegus Species
In present study, we sequenced the whole plastid genomes of four Crataegus species using Illumina HiSeq 4000 platform. The sizes of the four plastomesranged from 159,607 bp (C. bretschneideri) to 159,875 bp (C. maximowiczii). The chloroplast genomes of the four Crataegus speciesc contain 113 genes, consisting of 79 protein-coding genes, fourrRNA genes, and 30 tRNAgenes. The ycf15 and ycf68 genes were not annotated because these two genes were identified as pseudogenes comprising several internal stop codons [24]. In some species, ycf2, rpl23, and accD are missing from the chloroplast genome [22,45,46]; however, these genes are indeed present in Crataegus.Similar to most plants, the plastomes of the four Crataegus species are conserved, and no rearrangement events were found. The results of mVISTA and nucleotide diversity analyses revealed high levels of similarity among the plastomes, indicating that divergence of the four Crataegus plastomes is lower than that in other species [27,47,48]. Furthermore, lower sequence divergence in the IR region was detected compared to SSC and LSC regions, which has been previously reported [49,50]. One possible reason is that in the chloroplast genome, which has multiple copies per cell, gene conversion with a slight bias against new mutations would decrease the mutation load in the two IR regions much more efficiently than in the single-copy regions due to the duplicative characteristic of the IRs [51][52][53][54]. The expansion and shrinkage of the IR and single-copy junction regions is considered the leading mechanism driving variation in the size of angiosperm plastoms, thus playing a vital role in their evolution [27,55,56]. Although the overall genomic structure, such as gene order and gene number, is highly conserved, the chloroplast genome of C. bretschneideri exhibits significant differences at the IR/single-copy junction regions (Figure 2). We found that the contraction of IR regions caused the chloroplast genomes to decrease in size, as has been previously reported in other plants [57][58][59]. However, the size of the whole chloroplast genome does not always vary with expansion or contraction of IRs [31,60].

Repeat Structure and SSR Analysis of the Plastomes of Crataegus Species
REPuter software was used to determine repeat sequences in the four Crataegus chloroplast genomes with a copy size of 30 bp or longer and sequence identity >90% as criteria. A total of 196 repeats, comprising forward, complement, palindromic, and reverse repeats, were detected ( Figure 6; Table S1). LSC and spacer regions harbored a majority of repeats; SSC and intronic regions contained a minority of repeats ( Figure 6; Table S1). No rearrangements were identified in the four species examined, possibly due to a lack of large, complex repeating sequences (>100 bp). Repeat sequences, which play a key role in re-configuration of the genome, are useful for phylogenetic analysis [61,62]. Owing to improper recombination and slipped-strand mispairing of repeats sequences, genome rearrangement and sequence variation happen [26,56]. The occurrence of these repeats shows that these loci are crucial hotspots for genome reconfiguration [23,61]. Additionally, these repeats enable the development of molecular markers for phylogenetic and population genetics studies [63], with potential applications in Crataegus species.
SSRs, also known as microsatellites, have broad applications in population genetics and plant breeding programs [64,65]. The high polymorphism rates of SSRs are owing to slipped-strand mispairing on single DNA strands during DNA replication [66]. However, only a few chloroplast microsatellite loci have been identified in the genus Crataegus [67,68], which has hindered the identification, conservation, utilization, and breeding of Crataegus species in the context of population genetics and phylogeographic studies. A total of 386 SSRs were detected among the chloroplasts of the Crataegus species examined in this study (Figure 7; Table S2). Among the 386 SSRs, most loci are located within spacers (74.6%), followed by introns (14.0%) and exons (11.4%) ( Figure 7A; Table S2), which is congruent with the results of similar researches of other plant taxa [24,69]. This may be due to the higher mutation rates in the spacer regions compared to the coding regions. Further analyses showed that a majority of the SSRs were located in the LSC region, whereas 8.3% and 10.6% were located in the IR and SSC regions ( Figure 7B; Table S2), respectively. These results correspond to those of previous studies [43,70], in which SSRs were found to be unevenly presented in plastomes. Our results may facilitate the selection of valuable genetic markers for examining intra-and interspecific polymorphisms. Furthermore, most mononucleotide and dinucleotide repeats are composed of A and T ( Figure 7C,D; Table S2), which may contribute to a bias in base composition, congruent with other plastomes [71]. This indicates that the Crataegus chloroplast genome contains polyA and polyT repeats with irregular G and C repeats, similar to various other species [25,57]. In general, the SSRs in the four Crataegus plastomes examined in this study exhibit high levels of variation and can serve as potential molecular markers for future population genetic studies of Crataegus species.

Potential Highly Variable Chloroplast Barcodes
Comparison analysis of the chloroplast genome sequences indicated several regions of sequence polymorphisms (Figure 3). In accord with recent research [23,24], most of the sequence variations are distributed in the LSC and SSC regions, whereas the IR regions exhibit comparatively less sequence variation. The lowersequence divergence of the IR region compared to the single-copy regions in Crataegus species and other plants may be due to copy correction between IR sequences during gene conversion [24,72].
Increasingly more studies have indicated that universal DNA markers have low sequence divergence and poor discriminatory power [37,73]. Previously, several chloroplast and nuclear DNA markers have been used for phylogenetic analysis of Crataegus and to resolve intraspecific and interspecific relationships [13,67,68]. Because Crataegus is widely distributed in Eurasia and North America [1], it is challenging to carry out DNA barcoding and taxonomic assessments for this genus. Therefore, the development of novel markers and broader taxonomic sampling is necessary to provide greater phylogenetic resolution at low taxonomic levels. Additionally, hawthornis one of the most important processing and table fruits in China [3]; the study of its taxonomy, genetics, conservation, and identification is hindered by a lack of genomic resources for Crataegus. Chloroplast genome sequences present an important clue for investigating genome evolution and produce valuable genetic resources for further studies in future.
The ndhC-trnV(UAC)-trnM(CAU) region is composed of two intergenic spaces (ndhC-trnV(UAC) and trnV(UAC)-trnM(CAU)) and an intron (trnV) with an average length of 1416 bp; this region is the most variableamong the four Crataegus plastomes ( Figure 5). ndhC-trnV, trnV, and trnV-trnM were suggested by the authors of [28,60] to be high-variability markers that can be used for DNA barcoding and molecular phylogenetic studies. The trnR-atpA is part of the trnG-atpA intergenic marker, which is split into two intergenic regions: trnG-trnR and trnR-atpA. The trnG-atpA region suggested by the authors of [27] was found to be a high-variability marker in Corylus. The ndhF and ndhF-rpl32 regions have been extensively applied in phylogenetic analysis [31,56,74,75]. Two rarely reported highly variable regions, ndhA and atpH-atpI, are distributed in the four Crataegus chloroplast genomes and were detected in the present study.

Phylogenetic Relationships
As one of the original cultivation centers, China has a long history of cultivating and collecting hawthorns [15]. A total of 18 species and sixvarieties of Crataegus have been identified and confirmed in China [10], though valuable cultivated varieties aremainly derivedfrom four species: C. pinnatifida, C. scabrifolia, C. hupehensis, and C. bretschneideri. However, the interspecific relationships of cultivated Crataegus in China remain unclear. Our previous study revealed that C. bretschneideri might have arisen through hybridization between C. maximowiczii and C. pinnatifida, and that introgression happened between C. hupehensis, C. pinnatifida, and C. pinnatifida var. major [17]. In this study, the whole chloroplast genome sequences of C. bretschneideri, C. pinnatifida, C. maximowiczii, and C. pinnatifida var. major were used to further assess their phylogenetic relationships. We sequenced the plastomesof the four Crataegus species, thus reporting the first comprehensive analysis of Crataegus chloroplast genomes. For the phylogenomic analysis, we also examined three published Crataegus whole chloroplast genomes obtained from the GenBank database: those of C. hupehensis (MW201730.1), C. pinnatifida (MN102356.1), and C. marshallii (MK920293.1).
We then performedphylogenetic analysis of seven Chinese Crataegus species on the basis of their entire chloroplast genomes. The ML phylogenetic tree revealed the presence of two major clusters (Figure 8). One cluster included C. bretschneideri, C. marshallii, and C. maximowiczii, in which C. bretschneideri and C. maximowiczii clustered into a subclade and formed a sister relationship with the C. marshallii subclade. This suggests that C. bretschneideri is a distinct Crataegus species, rather than a variant of C. pinnatifida, which is in agreement with earlier studies [15,16]. However, the phylogenetic tree indicated C. bretschneideri is more closely related to C. maximowiczii than to C. pinnatifida, which differs from the findings reported in [16]. Our previous results indicated that C. bretschneideri might have arisen through hybridization between C. maximowiczii and C. pinnatifida [17]; given the maternal inheritance of chloroplasts, the present results suggest that C. maximowiczii is the maternal origin of C. bretschneideri. The other cluster included C. hupehensis, C. pinnatifida, and C. pinnatifida var. major; C. pinnatifida (MZ494514) and C. pinnatifida var. major were grouped together, with C. hupehensis and C. pinnatifida (MN102356.1) clustering as a sister group. The variation among the chloroplast sequences matched the differences in the geographical distribution of each species, suggest that repeated chloroplast DNA introgression led to this pattern [76]. Crataegus pinnatifida (MN102356.1) from the southwestern region of China and C. pinnatifida (MZ494514) from the northern region of China did not cluster together into a subclade, indicating that C. pinnatifida may hybridize with other species to accomplish chloroplast DNA introgression and interspecific transfer. In our previous study, specific locus amplified fragment sequencing showed thatintrogression occurredbetween C. pinnatifida, C. pinnatifida var. major, and C. hupehensis [17]. Chloroplast capture is an important process of plant evolution [76]. Due to hybridization and repeated backcross, the cytoplasm of one species may be replaced by the cytoplasm of the other species through gene flow infiltration. Therefore, the genetic components of one species not only have nuclear genome components inherited from parents, but also capture new chloroplast gene components. Increasingly more studies have proved the phenomenon of organelle DNA introgression [27,77]. The result presented here also supports our previous conclusion that introgression happened between C. hupehensis, C. pinnatifida, and C. pinnatifida var. major [17]. Based on the present study and our previously published data, we hypothesize that partial C. pinnatifida germplasms arose via the hybridization of C. hupehensis and C. pinnatifida, and that C. pinnatifida var. major might have been artificially selected and domesticated from hybrid populations.

Conclusions
In the present study, the complete chloroplast genomes of three cultivated species and one related Crataegus species were sequenced and assembled. We provided valuable genomic resources for Crataegus. Comparative analyses of the plastomes identified variable regions with potential application as species-specific DNA barcodes. The six hypervariable hotspots, 196 repeats, and 386 SSRs detected should facilitate phylogenetic analyses and the development of molecular markers. Our whole-chloroplast phylogenomic analysis provided valuable information that partially uncovered the phylogenetic relationships of cultivated Crataegus in China. Furthermore, our findings suggest that C. bretschneideri is a distinct Crataegus species, rather than a variant of C. pinnatifida. Combined with our previous study, the present work indicates that C. maximowiczii is the maternal origin of C. bretschneideri. Our data also suggest that introgression happened between C. hupehensis, C. pinnatifida, and C. pinnatifida var. major. Furthermore, we hypothesize that C. pinnatifida var. major might have been artificially selected and domesticated from hybrid populations, rather than evolved from C. pinnatifida. The genetic resources obtained in this study will facilitate future research into the population genetics, species identification and conservation of Crataegus.