Reassessment of Annamocarya sinesis (Carya sinensis) Taxonomy through Concatenation and Coalescence Phylogenetic Analysis

Due to its peculiar morphological characteristics, there is dispute as to whether the genus of Annamocarya sinensis, a species of Juglandaceae, is Annamocarya or Carya. Most morphologists believe it should be distinguished from the Carya genus while genomicists suggest that A. sinensis belongs to the Carya genus. To explore the taxonomic status of A. sinensis using chloroplast genes, we collected chloroplast genomes of 16 plant species and assembled chloroplast genomes of 10 unpublished Carya species. We analyzed all 26 species’ chloroplast genomes through two analytical approaches (concatenation and coalescence), using the entire and unique chloroplast coding sequence (CDS) and entire and protein sequences. Our results indicate that the analysis of the CDS and protein sequences or unique CDS and unique protein sequence of chloroplast genomes shows that A. sinensis indeed belongs to the Carya genus. In addition, our analysis shows that, compared to single chloroplast genes, the phylogeny trees constructed using numerous genes showed higher consistency. Moreover, the phylogenetic analysis calculated with the coalescence method and unique gene sequences was more robust than that done with the concatenation method, particularly for analyzing phylogenetically controversial species. Through the analysis, our results concluded that A. sinensis should be called C. sinensis.


Introduction
Annamocarya sinensis, controversially called Carya sinensis, is widely distributed in southern China and northern Vietnam. As a deciduous tree, it is generally about 30 m tall, has a 125-cm diameter trunk, and has grayish bark with alternate leaves (Figure 1) [1]. As the individual number of A. sinensis has reduced and its distribution narrowed, it is now a Class II endangered species in China [2]. It belongs to the Juglandaceae family, but its genus is still controversial; while most plant scientists classify it as Annamocarya, some classify it as Carya. This dispute has been caused by different taxonomic approaches, especially in morphology and evolutionary genomics.
Dode (1912) first described A. sinensis as C. sinensis based on limited herbarium samples, seed specimens without leaves, and flower samples [3]. Hence, his classification was controversial [4]. Chevalier (1941) [5] described the morphological characteristics of the leaves and nuts and concluded that it belonged to a new, single genus, Annamocarya. Nevertheless, Manning and Hjelmqvist (1951) [4] questioned whether, given that Carya and it shared plenty of similar morphological characteristics, it was necessary to describe this uncertain species as a single genus. Most morphologists believe that it is distinguished from Carya based on several morphological characteristics. For example, the leaves of A. sinensis are smooth, while Carya has serrated leaves; A. sinensis has five to eight catkins per bundle, whereas Carya has only three. The staminate bract of A. sinensis have three or more vascular strands, but only one in Carya [6]. And Leroy (1955) [7] proposed that the characters of the shell might be an important reason to distinguish it from Carya. Grauke et al. (1991) [3] compared the differences between Carya and Annamocarya in terms of tree shape, leaves, flowers, fruit shape, and other samples collected from Cuc Phuong National Forest in Vietnam. They concluded that the separate genus Annamocarya was necessary. As more morphological pieces of evidence became available, the genus Annamocarya was adopted over a long time. In the 21st century, with the rapid development of sequencing technology, researchers have increasingly used gene sequences to study the evolutionary history of species with a phylogenomic approach. Manos and Stone (2001) [6] analyzed the phylogeny of Juglandaceae by using the internal transcribed spacer (ITS) and a few chloroplast genes. Their results suggest that A. sinensis is closely related to the Carya genus. In contrast, Li et al. (2004) [8] separated A. sinensis and Carya on different branches of infrafamilial relationships of the order Fagales based on six genes (atpB, matK, matR, rbcL, trnL-F and 18S rDNA). Manos et al. (2007) [9] investigated the evolution of Juglandaceae by integrating fossils and nucleic sequence evidence. Several results in their study suggest that A. sinensis should still be grouped within the Carya genus.   [10] used six plastid fragments (matK, rbcL-atpB, rpoC1, rps16, trnH-psbA, and trnL-F) and nuclear markers (ITS and phyA) in their study; the results suggest that A. sinensis should be clustered within the Carya genus. It seems, based on these examples, that sequence analysis is likely to classify A. sinensis in the Carya genus.
Recently, nuclear and organelle genes have commonly been used to analyze the phylogeny of species. Among them, chloroplast genome data has been utilized due to its low evolution rate and stable in variation [11]. It has been widely used to study phylogeny, chloroplast inheritance, domestication history, and adaptative evolution [12,13]. For example, Wu et al. (2020) [14] constructed a phylogenetic analysis of Chrysosplenium based on the whole chloroplast genome. They conclude that Chrysosplenium could be divided into two subgroups: those having alternate leaves or opposite leaves. Summarizing empirical studies that studied the phylogeny of Carya based on chloroplast genes, the similarity among them is that only a few chloroplast genes or fragments were utilized [6,9,10]. Researchers have found that phylogeny trees built from individual genes may show discordance on the species tree [15,16]. Furthermore, as the number of gene trees increases, they may converge in probability to the true species tree [17]. Hence, phylogenetic trees constructed entirely of chloroplast genome genes may be reliable.
Since the development of next-generation sequencing and computational phylogenomics, the reconstruction of angiosperm phylogenies from multiple genes has relied upon concatenation methods [18]. The concatenation method concatenates multiple gene sequences of each species and treats them as one alignment to generate a phylogenetic tree ( Figure 2A). As the number of subsampled nuclear genes increases, the result of concatenation analyses becomes dependable [18]. However, a simulation study by Kubatko and Degnan (2007) [19] shows that such approaches could lead to misleading phylogenetic calculations. They demonstrated that using concatenated data, species tree estimation performed deficiently and was statistically inconsistent even under stable population size without selection or population stratification. In addition to concatenation method, recent studies have also taken advantage of coalescence method for phylogenetic analysis [20,21].This method first computes gene trees by the individual gene sequences and then aggregates all gene trees to obtain the final species tree ( Figure 2B) [15]. Goncalves and colleagues [22] conducted a phylogenetic analysis of 78 plastid genes, and their results showed that the phylogenetic tree inferred by the coalescence method was reliable. Xi et al. (2013) [23] propose that the coalescence method might reduce the potential deleterious effect of elevated substitution rates in phylogenomic analyses. To test this, Xi et al. (2014) [18] used the two methods to analyze the phylogeny of Amborella within 45 seed plants and found that fast-evolving sites likely disrupt the concatenation method, while the coalescence method appears more robust in response to elevated substitution rates. In this work, we first assembled 10 chloroplast genomes of the Carya species for observing the phylogenetic relationship between A. sinensis and Carya. We determined the true phylogenetic location of A. sinensis by comparing and evaluating the concatenation and coalescence methods with the chloroplast genome of 26 species. Our results indicate that A. sinensis was clustered within the Carya genus. Hence, we conclude that Annamocarya sinesis should be referred to as Carya sinensis. Moreover, the phylogeny results constructed by the coalescence method are more robust than those by the concatenation method.

Results
We analyze the phylogenetic position of A. sinensis using the genes of the chloroplast genome. A total of 14 species belonging to the Carya genus of Eastern Asia and North America were included in this work (four species of Eastern Asia include C. cathayensis, C. dabieshanensis, C. hunanensis, and C. kweichowensis; 10 species of North America include C. aquatica, C. cordiformis, C. glabra, C. illinoinensis, C. laciniosa, C. myristiciformis, C. ovata, C. palmeri, C. texana, and C. tomentosa) [10]. The CDS and protein sequences of all 26 species of six genera (Betula, Carya, Cyclocarya, Juglans, Platycarya, and Pterocarya) were used for the analysis.

Assemblies and Annotations of Chloroplast Genomes
We first obtained the raw sequencing data of 10 species of the Carya genus (Table S1). Then, we assembled the 10 complete closed-loop chloroplast genomes of these species ( Figure 3 and Figure S1) for the analysis below. From the assemblies, we learned that the average size of the 10 species was about 160 kb ( Table 1). The chloroplast genome of C. cordiformis was the longest at 160,796 bp, and the shortest one was C. dabieshanensis at 160,037 bp; the difference between the two was 759 bp. The GC content of the assembled 10 chloroplast genomes was similar, around 36%.  Figure S1. We also obtained the chloroplast genomes of 16 species of six genera from other database, used GeSeq [24] to reannotate the chloroplast genomes of all 26 species to standardize the annotations previously generated by different annotation tools. The chloroplast genomes of the Carya genus contained an average of 79 coding genes, suggesting that they were highly conserved ( Table 2). The Cyclocarya and Pterocarya genera contained the most coding genes. The CDS GC content of all species was approximately 37.22%, with little difference. The uniformly reannotated chloroplast genomes were used for downstream analysis.

Unique Genes of the Chloroplast Genomes
After uniformly reannotating the chloroplast genomes, we used Orthofinder to identify 50 unique genes (Table 3) based on these chloroplast protein sequences. All 50 unique genes were single-copy genes in the chloroplast genome of all 26 species. Most of them were related to photoreaction and ribosome function. And according to these unique genes, we then extracted 50 unique CDS sequences and 50 unique protein sequences for downstream analysis. After extracting 50 unique genes, we constructed the phylogenetic trees using each individual gene. However, most of the 50 chloroplast gene trees ( Figure S2) showed chaotic phylogenetic relationships. For example, matK, ndhF, rbcL and rpoC1 genes (Figure 4), which are commonly used to construct phylogenetic trees, indicated that species of the genus Juglans failed to group together. And even if we rerooted Betula platyphylla as an outgroup, it cannot be separated from Platycarya strobilaceaat some point ( Figure 4A,B). The rpoC1 gene phylogenetic tree was the only tree correctly classified all genera and clustered pecans into East Asia and North America. Most phylogenetic trees showed the analyzed species were confusingly classified into the wrong genera. The evolutionary relationship of these results was unstable and inconsistent. Hence, we utilize the following two methods to achieve a certain degree of reliability.

Phylogenetic Analysis Based on the Concatenation Method
We collected whole CDS sequences and protein sequences (79-89 genes in each) from the reannotated chloroplast genomes of all 26 species. Then, we extracted the unique CDS sequences and unique protein sequences (50 genes in each) according to the analysis result of Orthofinder. The four phylogenetic trees of entire CDS sequences, entire protein sequences, unique CDS sequences, and unique protein sequences constructed by the concatenation method are shown in Figure 5A-D. Impressively, we noticed that the outof-group branches (genera of Betula, Cyclocarya, Juglans, Platycarya, and Ptercocarya) in these four trees were clustered in the same clades. Species in genera such as Juglans and Pterocarya were always clustered together. In all the results, as we hypothesized, A. sinensis was classified into the Carya genus branch. The results ( Figure 5B-D) clustered all Carya into Eastern Asian and North American subgroups well except for the phylogenetic tree using the entire CDS sequences ( Figure 5A), which showed low bootstrap credibility. Interestingly, compared to the results of entire gene sequences, the two results based on unique genes ( Figure 5C,D) showed more similarity, indicating that the phylogeny analysis using unique genes was stable. Moreover, A. sinensis was close with C. kweichowensis in these two result trees ( Figure 5C,D). The results are in accordance with the geographical distribution of these two species [2,25].

Phylogenetic Analysis Based on Coalescence Method
Subsequently, we again analyzed the phylogenetics of A. sinensis based on the coalescence method with entire CDS sequences, entire protein sequences, unique CDS sequences, and unique protein sequences. The resulting four phylogenetic trees are shown in Figure 6. It should be noted that these coalescence results were computed by ASTRAL, which used a quartet score to score species trees instead of bootstrap. The closer the score is to 1, the more credible the tree is. Scores of the four results were all greater than 0.8, denoting credibility. As was conjectured, the out-of-group branches (the five genera mentioned before) of the trees were clustered in correct phylogenetic positions with high confidence support (blue pie charts, Figure 6). Similarly, A. sinensis in all results was consistently classified into the Eastern Asian Carya genus branch, especially close with C. kweichowensis. The conflict analysis of CDSs ( Figure 6A,C) showed more concordant with gene trees than protein sequences ( Figure 6B,D). The clades of Northern American Carya, on the other hand, showed much common conflicts (green pie charts, Figure 6B,D), perhaps because of the degeneracy among these close species. Surprisingly, all leaves of phylogenetic tree constructed by the coalescence method, either using entire-gene or unique-gene sequences, were almost identical ( Figure 6).

Discussion
The taxonomic status of A. sinensis has long been debated. Morphologists distinguished it from the Carya genus by morphological features such as leaves, bunches, bracts, and nut shells. However, the accuracy of sequencing technologies nowadays makes molecular phylogeny being more common and credible [26]. Following the release of raw sequencing data (Table S1) for some species of Carya [27], we have assembled and annotated the chloroplast genomes of 10 species of Carya that have not yet been published. Among them, nine North American species were assembled using C. illinoinensis (MH909600) as a reference chloroplast genome, and only C. dabieshanensis (Eastern Asian) was assembled using C. cathayensis (NC_046572) as a reference chloroplast genome (Table S1). The chloroplast genomes we assembled (Table 1) using GetOrganelle were of high quality; both the size of the LSC/SSC/IR region and the GC content were similar to the published chloroplast genomes of other Carya species [28]. All these characteristics of Juglandaceae chloroplast genomes show stability in variation and low evolutionary rate [11].
Juglans has been divided into a four-section classification (Cardiocaryon, Dioscaryon, Rhysocaryon, Trachycaryon) [29]. Six species of these four sections were selected as the reference taxa in our evolutionary trees. As with the eight phylogenetic trees mentioned in Figures 5 and 6, we observed that all outgroup taxa were clustered into proper phylogenetic positions. In particular, the Carya genus were categorized correctly into Eastern Asia and North America in both methods, which means that the analytical strategies we employed were reliable. Our results show that A. sinensis does have a close phylogenetic relationship with the Carya species, rather than separating itself into an independent clade. Our results also show that A. sinensis close to C. kweichowensis in the Carya clade. Both of these two species are scattered in southwestern China with an endemic distribution area [2,25]. The phylogenetic relationship of our results was highly consistent with geographical distribution, indicating that A. sinensis is reliably clustered in the Carya genus, in the Eastern Asian branch, whether based on molecular or geographical evidence. We conclude that A. sinensis should belong in Carya rather than divided into a single genus. Hence, it should be referred to as Carya sinensis.
The phylogenetic trees created by a single gene indicate that most of these single-gene trees cannot classify all species into the correct genera. It reveals that the evolutionary tree created by this single chloroplast gene may be unreliable. We also analyzed the phylogeny of the entire chloroplast genome of 26 species ( Figure S3). However, the resulting taxa did not conform to the geographical distribution of these species. Therefore, it is necessary to adopt appropriate methods to make the evolutionary results stable and reliable. The two main approaches we put used in this study were concatenation and coalescence. Compared with the traditional concatenation method, the coalescence method, a new computer method, was more effective and robust when applied to phylogeny analysis [20]. The phylogenetic trees created by ASTRAL includes the conflicts of each node (displayed as pie charts), indicating that coalescence results were concordant with highly supported gene trees (blue pie charts, Figure 6). However, some nodes still showed conflicts (green or red pie charts), indicating that the gene trees which were inconsonant with species trees were still not negligible, especially in the American Carya genus branch. Conflicts caused by individual genes trees may be related to several reasons. Some plastid genes are multiple in one species while haploid in another species or the inverted repeats of genes were confused when extracting them and constructing phylogenetic trees. The correction of such conflicts between genes trees and species trees probably should be investigated in further study. The phylogenetic relationship between Carya species computed with the coalescence method ( Figure 6) showed high similarity. In contrast, these results analyzed by the concatenation method showed some heterogeneity ( Figure 5). In addition to the entire chloroplast genes, we also combined unique genes for analysis. The utilization of combined unique genes may provide a complete selection of independent characters for plant phylogenetic analysis [30]. Our results showed that A. sinensis was always next to C. kweichowensis in the Carya clade when using the combined unique gene sequences for the analysis ( Figure 5C,D and Figure 6C,D), indicating that the combined unique gene sequences were stable and consistent for the phylogenetic analysis. The very low non-conservative percentage of the sequence alignment ( Figure S4) of the chloroplast genes of A. sinensis and C. kweichowensis suggested that this approach is reasonable. All these results conclude that the coalescence method using unique genes is recommended when computing phylogenetic species trees, especially for the analysis of phylogenetic controversial or close related species.
Even if molecular analysis evidence seems more accurate in classification, we cannot ignore the practicality and reliability of morphological evidence. Compared with molecular analysis, morphological analysis is serviceable to identify plants in field studies. With permitted conditions, by comparing the molecular and morphological analysis of closely related species, it may be possible to find out the reasons for the differences between them and perhaps prove the correct classification of these species better.

Data Sets
In this work, a total of 26 chloroplast genomes in five genera of Juglandaceae (Carya, Cyclocarya, Juglans, Platycarya, and Pterocarya) and Betula platyphylla were used for phylogeny analysis. Among them, 15 complete chloroplast genomes of Juglandaceae and Betula platyphylla were obtained from National Center for Biotechnology Information (NCBI, https://www.ncbi.nlm.nih.gov/, accessed on 1 March 2021; Table S2). The raw sequencing data of another 10 species of Carya (Table S1) [27] whose complete chloroplast genome had not been disclosed were downloaded and assembled. The assembled chloroplast genome of these 10 species was then uploaded to NCBI (Table S3). The chloroplast genome of C. illinoinensis (cultivar Pawnee) was sequenced and assembled by our lab and can be downloaded by the accession GWHBGBH00000000 at National Genomics Data Center (https://ngdc.cncb.ac.cn/, accessed on 1 November 2021) [31,32].

Chloroplast Genomes Assembly and Gene Annotation
The quality of raw sequencing data (raw reads) mentioned above was assessed using fastp v0.20.1 [33] with default parameters. After the quality assessment, the reads were mapped against the chloroplast genome of C. illinoinensis cultivar 87MX3-2.11 (MH909600) using BWA-MEM 0.7.17 [34] to obtain clean chloroplast reads. The SAM mapping results were then transformed to BAM format by SAMtools v1.9 [35,36] for assembly. The complete closed-loop chloroplast genome of each species was assembled using GetOrganelle v1.7.137 [37] with particular reference chloroplast genomes (Table S1). These assembled chloroplast genomes were annotated by the GeSeq tool [24] with C. illinoinensis (MH909600) as the reference genome.

Analysis of Unique Chloroplast Genes
We first used GeSeq to uniformly reannotate the chloroplast genomes of all 26 species to standardize the annotations previously generated by different annotation tools. Then the genomes' protein sequences were subjected to Orthofinder [38,39] analysis with default parameters to obtain the single-copy genes (unique genes). Then both the protein and nucleotide sequences of these unique genes were extracted and prepared for subsequent phylogeny analysis.
The alignment of unique protein and nucleotide sequence files was created with L-INS-I parameters using MAFFT v7.271 [40,41]. Next, these alignment results were then used to construct maximum likelihood phylogenetic trees using IQtree [42] with K3Pu+F as the best fit model and 500 bootstraps.

Phylogeny Analysis Based on Concatenation Method
The entire chloroplast coding sequence (CDS) and protein sequences (entire-genes) of each species (Table 2) were concatenated to make super sequences (Figure 2A) using in-house Linux scripts. The super-entire-CDSs and super-entire-protein sequences of all the chloroplast genomes were acquired. Then, the unique CDS and protein sequences (unique-genes) of all species were concatenated into super-unique-CDSs and super-uniqueprotein sequences. The multiple alignment of all entire-genes and unique-genes of the super sequences were individually constructed using MAFFT v7.271 with FFT-NS-2 parameters. The phylogenetic tree of the concatenation method was obtained using FastTree [43,44] with the Jukes-Cantor model. Finally, these phylogenetic trees were visualized using ITOL [45].

Phylogeny Analysis Based on Coalescence Method
All coding gene sequences of chloroplast genomes of 26 species were extracted using Seqkit [46]. After gathering common coding gene sequences into common files, a total of 73 specific gene sequence files were acquired (Table S4). These 73 CDS and protein sequences were used to contract the parallel-entire-genes phylogeny trees ( Figure 2B). Similarly, the alignment of each specific sequence file was also created using MAFFT v7.271 and constructed phylogenetic trees using IQtree with the same parameters as described in unique chloroplast genes above. All parallel-genes phylogeny trees generated by IQtree were first merged into one file using ASTRAL with the "-b" parameter. And the output (102 trees in default) was then computed to the final species trees using ASTRAL again. The conflicts of nodes of species trees were analyzed using Phy-Parts [47]. The visualization of trees with pie charts were drawn using PhypartsPieChart (https://github.com/mossmatters/phyloscripts/tree/master/phypartspiecharts, accessed on 7 November 2021).

Conclusions
The phylogenetic trees created using a single unique gene are unreliable. The concatenation and coalescence methods possibly are appropriate approaches to study the phylogeny of close species. Compared to the concatenation method, the phylogenetic trees constructed using the coalescence method shows stability and reliability. We collected 16 chloroplast genomes and assembled 10 chloroplast genomes of Carya species to analyze the phylogeny of A. sinensis. Our analysis of the chloroplast genomes with concatenation and coalescence methods showed that A. sinensis is clustered into the Carya genus. Our results concluded that A. sinensis, therefore, should be called C. sinensis.

Supplementary Materials:
The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/plants11010052/s1, Figure S1: The circular chloroplast genome map of 10 assembled chloroplast genome; Figure S2: 50 phylogenetic results constructed by unique chloroplast genes; Figure S3: Phylogeny analysis of the entire chloroplast genome of 26 species; Figure S4: The alignment of chloroplast genomes of A. sinensis, C. cathayensis and C. kweichowensis; Table S1: Assembled chloroplast genome of 10 species;