Conflicts in Mitochondrial Phylogenomics of Branchiopoda, with the First Complete Mitogenome of Laevicaudata (Crustacea: Branchiopoda)

Conflicting phylogenetic signals are pervasive across genomes. The potential impact of such systematic biases may be reduced by phylogenetic approaches accommodating for heterogeneity or by the exclusive use of homoplastic sites in the datasets. Here, we present the complete mitogenome of Lynceus grossipedia as the first representative of the suborder Laevicaudata. We employed a phylogenomic approach on the mitogenomic datasets representing all major branchiopod groups to identify the presence of conflicts and concordance across the phylogeny. We found pervasive phylogenetic conflicts at the base of Diplostraca. The homogeneity of the substitution pattern tests and posterior predictive tests revealed a high degree of compositional heterogeneity among branchiopod mitogenomes at both the nucleotide and amino acid levels, which biased the phylogenetic inference. Our results suggest that Laevicaudata as the basal clade of Phyllopoda was most likely an artifact caused by compositional heterogeneity and conflicting phylogenetic signal. We demonstrated that the exclusive use of homoplastic site methods combining the application of site-heterogeneous models produced correct phylogenetic estimates of the higher-level relationships among branchiopods.


Introduction
With an increasing number of mitogenomes being sequenced and various methodological advances, mitogenomic data have been successfully utilized to improve phylogenetic reconstructions across a wide range of taxa [1][2][3]. The large amount of available mitogenomic data has reduced the stochastic error (sampling error) on phylogenetic inference. Nevertheless, deep relationships between Arthropoda at the interordinal or intraordinal level have not been fully resolved, resulting in topologies with high support frequently conflicting with morphological and nuclear phylogenies [4][5][6]. Such strong support but incorrect phylogenies represent systematic errors which can be traced back to homoplastic characteristics in datasets and model violations [7,8]. Substitutional saturation was the most frequently discussed cause of homoplasy in nucleotide gene data [9]. Most substitution models assume compositional homogeneity (stationary), but nucleotide and protein sequences might also exhibit nonstationarity, which strongly violates the assumptions of the stationary models [10,11]. The most common source of model violations are compositional heterogeneity and rate heterogeneity among lineages [12][13][14].
Suborder Laevicaudata Linder, 1945 (Branchiopoda: Diplostraca), or smooth clam shrimp, is a unique group of essentially benthic micro-crustaceans. Among branchiopods, Laevicaudata can be recognized by a usual body length less than 7 mm, a bivalved carapce, a proportionally large head, bearing a row of large teeth on the mandibular molar surface and having laminae abdominalis supporting egg clusters [15][16][17][18][19]. Laevicaudata currently BioEdit 7.0.9.0 [40]. Sequence annotation and accurate boundary determination of PCGs and rRNA genes were first performed by the NCBI's web interface for BLAST [41] and then by alignment with the homologous genes from other released sequences of Branchiopoda. Miotochondrial tRNA genes and their secondary structures were identified by a combination of MITOS online software [42] and tRNAscan-SE 1.2.1 online software [43].
We estimated saturation for each PCG, and for the three codon positions using DAMBE 6 [45], which determined an "index of substitution saturation" (I ss ) based on the notion of entropy in information theory. We excluded partitions or genes which showed significant nucleotide saturation from phylogenetic analyses. The non-synonymous substitution rate (K a ) for each taxon was calculated in comparison with the outgroup using DAMBE 6 [45].

Analyses of Sequence Heterogeneity and Phylogenetic Signal Dissection
We calculated the base composition of each taxon for each PCG and compared the AT% for each gene among the branchiopod species included in this study. The compositional diversity of amino acids of 13 PCGs across branchiopod suborders was obtained by calculating the frequency of four amino acids which were encoded by GC-rich codons (glycine, alanine, arginine and proline; GARP). The homogeneity of substitution pattern (I D test) for each gene was estimated using a Monte-Carlo method with 1000 replicates implemented in MEGA X [44]. The null hypothesis that sequences have evolved with the same pattern of substitution was rejected at α < 0.01. We also evaluated the compositional heterogeneity in each of the 13 mitochondrial proteins separately by performing posterior predictive analysis (PPA) with the global test statistic as implemented in PhyloBayes 4.1c [46]. A significant conflict between the branchiopod phylogenies is the placement of Laevicaudata. To resolve the deep relationship between Laevicaudata and Onychocaudata and to address the sources of deep phylogenetic conflict, we divided taxa into four clades: (1) Anostraca; (2) Notostraca; (3) Laevicaudata; and (4) Onychocaudata.
Three methods for reducing the nonphylogenetic signals were conducted: (1) exclusion of the genes with the most strongly deviating composition according to the AT% of L. grossipedia; (2) exclusion of the proteins with a significant model violation according to posterior predictive analysis (PPA); and (3) removal of fast-evolving sites. To evaluate the key phylogenetic splits and visualize the phylogenetic content of datasets, we conducted four-cluster likelihood mapping analyses (FcLM) using both nucleotide and amino acid datasets as implemented in TreePuzzle v5.3 [47]. We preferred the topology of the currently accepted relationships within Branchiopoda: (((Laevicaudata, Onychocaudata), Notostraca), Anostraca).

Phylogenetic Analyses under Site-Homogeneous Models
In order to compare the results of phylogenetic inference from different evolutionary models, phylogenetic analyses of four datasets (PCG12, PAA, Pnuc6 and Paa7) were first carried out under site-homogeneous models implemented in RAxML 2.2.3 [48] for maximum likelihood inference (ML) and Bayesian inference using MrBayes 3.2 [49]. Because highly heterogeneous sequence divergence was present across branchiopod clades, and applying standard homogenous models might prompt inaccurate inferences, we did not apply this method to the Paas7 matrix. The best-fit model for the nucleotide dataset (Pnuc6) according to the Akaike Information Criterion (AIC) was determined using jModelTest version 0.1.1 [50], and ProtTest 3 [51] was used for the amino acid dataset (Paa7). The best selected partition schemes and models of two datasets were listed in Supplementary Table  S3. For the ML analyses, branch support of two datasets was estimated using the rapid bootstrap method in RAxML with 1000 replicates. Analyses with the software MrBayes were conducted in two simultaneous runs, each with four chains, for 10 million generations, and trees being sampled every 1000 generations. The first 25% were discarded as burn-in, and the remaining trees were used to calculate Bayesian posterior probability (BPP) values. Values of the Potential Scale Reduction Factor (PSRF) approaching 1.0 suggested that the runs reached convergence.

Phylogenetic Analyses under Site-Heterogeneous Models
Substitution saturation is recognized as one of the primary obstacles for deep phylogenetic inference, and removing sites that have experienced multiple substitutions would make for erratic phylogenetic estimates [52]. In order to correctly retrieve the phylogenetic signals of pattern-heterogeneity from mitogenomic sequence data, we performed Bayesian inference analyses under the site-heterogeneous model CAT + GTR for three datasets (Pnuc6, Paa7, and Paas7), as implemented in PhyloBayes 4.1c [46]. Two independent chains of 5000 cycles were run for each analysis, with one point every five samples. The initial 1000 trees sampled in each MCMC run were discarded as burn-in after checking for convergence using bpcomp (max_diff < 0.3). The 50% majority-rule consensus tree and the associated posterior probabilities (PPs) were then computed using all chains.
Bayesian cross-validation [53] was used to compare the fit of site-homogeneous (LG and GTR) and site-heterogeneous (CAT-mtREV and CAT-GTR) models as implemented in PhyloBayes 4.1c [46]. Ten replicates were conducted, 1100 sampling cycles were run and the first 100 samples were discarded as burn-in. Fast-evolving sites for Paa7 matrix were identified using the discrete gamma rate category to which they belong using TreePuzzle v5.3 [47], and the sites belonging to the most rapidly evolving gamma category were removed.

Characteristics of L. grossipedia Mitogenome
The complete mitogenome of L. grossipedia is 15,023 bp in length (GenBank accession number: OP746066). This is the first completely sequenced mitogenome in the order Laevicaudata. All of the 37 typical animal mitochondrial genes were identified, consisting of 13 PCGs, 22 tRNAs, two mitochondrial ribosomal RNAs (rrnS and rrnL) and a putative control region (Table 1). Twenty-three genes were encoded by the majority strand (J-strand) and fourteen by the minority strand (N-strand). Gene arrangement of the branchiopod mitogenomes was considered to be rather well-conserved, although several events of translocation, inversion, tandem duplication and random loss have occurred [33,34]. We found two gene rearrangement phenomena in the mitogenome of L. grossipedia: (1) the local inversion of trnI, and (2) the remote inversion of trnL 1 . The latter observed at the nad1-rrnL junction was the dominant gene rearrangement event in Branchiopoda. In addition to the control region, the mitogenome of L. grossipedia had 181 bp of intergenic nucleotides in 13 different locations, with intergenic spacer lengths ranging from 1 to 63 bp. The longest intergenic spacer was located between trnG and nad3 (Table 1). In the L. grossipedia mitogenome, ATN codons initiated all PCGs. Six PCGs used TAA/TAG as the termination codons, while truncated termination codons (T) was observed in the other seven genes (Table 1). For L. grossipedia, the AT content of the complete genome, PCGs, rRNA, tRNA and control region were greater than those of Leptestheria brevirostris Barnard, 1924, 75%, 73.6%, 76.8%, 75.8% and 81%, respectively ( Table 2). The highest AT content occurs in the third codon position of PCGs (84.3%). Atp8 had a very high AT content (83.7%), while the lowest AT content was found in cox1 (66.8%). The AT contents of L. grossipedia were the highest when compared with other species of Branchiopoda, showing an obvious AT mutation bias. In most branchiopods, the mitogenome has a positive AT-skew and negative GCskew [37,39]. L. grossipedia and Leptestheria brevirostris exhibited a similar pattern ( Table 2). The nucleotide skew statistics for the mitochondrial genomes of L. grossipedia analysed in the present study also indicated the following: (1) the AT-skew of all PCGs was negative (−0.32~−0.07); (2) the GC-skew was positive and the AT-skew of each PCG on the minority strand was negative, whereas both the GC-skew and AT-skew of each PCG on the majority strand were negative (except cox1) ( Table 2). Table 2. Nucleotide composition and skewness levels of L. grossipedia (Laevicaudata)/Leptestheria brevirostris (Spinicaudata).

Regions
Nucleotide

Levels of Substitutional Saturation and Heterogeneous Sequence Divergence within Branchiopod Mitogenomes
The third codon positions were saturated for all genes, and about half of the first and second codon position also showed significant levels of saturation (Table S4). Therefore, they were not considered for further phylogenetic analyses.
The value of K a was low for Notostraca (0.24~0.25), Spinicaudata (0.26~0.27) and Cladocera (0.25~0.27), but generally higher for Anostraca (0.32~0.38) and Laevicaudata (0.32 ± 0.01), which suggested that Anostraca and Laevicaudata had relatively higher evolutionary rates among Branchiopoda. We analysed the compositional heterogeneity of both the nucleotides and amino acids of 13 PCGs across branchiopod suborders. There was considerable variation in the AT content of mitogenomes within branchiopods (47.8%~83.7%), and Laevicaudata had the highest AT% by a high margin (Table 3). There was considerable variation in GC-encoding GARP amino acids of the mitochondrial genome within Branchiopoda (range: 14.37%~18.78%; mean: 17.66%; standard deviation: 1.24), and Laevicaudata had the lowest GARP%. Our observation showed a high degree of compositional heterogeneity among branchiopod mitogenomes in both nucleotide and amino acid level, which led to systematic error in phylogenetic analyses [11,[54][55][56]. To test the homogeneity of substitution pattern, we made 903 pairwise comparisons to calculate the I D . When we compared the concatenated 13 PCGs, 719 comparisons had a statistically significant heterogeneous substitution pattern, suggesting that the substitution pattern evolved multiple times. The I D test on each 13 PCGs also suggested a high level of variation in the substitution patterns among different genes ( Table 3). The null hypothesis that sequences have evolved with the same pattern of substitution was rejected (α < 0.01), although a correlation was observed between the level of variation in the substitution patterns and the gene lengths.

Phylogenetic Analyses Using Standard Homogeneous Models
Homogeneous analyses of either nt or aa data yielded maximal support for the monophyly of Anostraca and Phyllopoda (ML nt&aa -BS = 100%; Figure 1). Phyllopoda included four major groups (Cladocera, Spinicaudata, Laevicaudata and Notostraca), and Laevicaudata was resolved as a sister to the remaining phyllopods. However, the relationships among these four groups differed based on different datasets: monophyletic Notostraca was resolved as a sister to Onychocaudata when inferences were drawn from nucleotide data (ML nt&aa -BS = 95%; Figure 2a), whereas Notostraca occupied a sister position to Spinicaudata based on the amino acid data (ML nt&aa -BS = 62%; Figure 2b), which were consistent with the four-cluster likelihood mapping analyses (nt: 57.1% and aa: 72.1%, Figure 2c,d). These results, based on site-homogeneous analyses, were congruent with previous mitogenomes analyses [37], but not consistent with the currently accepted sister group relationship between Laevicaudata and Onychocaudata [26,29,33]. The conflict was not resolved by the Bayesian and ML analyses under site-homogenous models with a partitioning scheme for both Pnuc6 and Paa7 datasets, each matrix supporting similar trees ( Figure S1) to those presented in Figure 1.

Reducing Compositional Heterogeneity in Sequence Data
Phylogenetic analyses of the individual mitochondrial genes and proteins and PPA test demonstrated that both the nucleotide composition of all 13 PCGs and the amino acid composition of six among the 13 mitochondrial proteins violated the assumptions of the CAT model (Table 4), indicating that compositional bias was usually a genome-wide phenomenon [54]. ure 2c,d). These results, based on site-homogeneous analyses, were congruent with previous mitogenomes analyses [37], but not consistent with the currently accepted sister group relationship between Laevicaudata and Onychocaudata [26,29,33]. The conflict was not resolved by the Bayesian and ML analyses under site-homogenous models with a partitioning scheme for both Pnuc6 and Paa7 datasets, each matrix supporting similar trees ( Figure S1) to those presented in Figure 1.  amino acid datasets exhibit incongruence by the monophyly of Onychocaudata. (c) and (d) are results based on the PCG12 and PAA datasets respectively using the four-cluster likelihood mapping analyses. For each matrix, the sequences are divided into four clades: (a) Anostraca; (b) Notostraca; (c) Laevicaudata; (d) Onychocaudata. The corners of the triangles represent the three alternative topologies.

Reducing Compositional Heterogeneity in Sequence Data
Phylogenetic analyses of the individual mitochondrial genes and proteins and PPA test demonstrated that both the nucleotide composition of all 13 PCGs and the amino acid composition of six among the 13 mitochondrial proteins violated the assumptions of the CAT model (Table 4), indicating that compositional bias was usually a genome-wide phenomenon [54].  The sequence heterogeneity analysis showed that Laevicaudata exhibited significantly higher heterogeneity than the other branchiopods. Laevicaudata being resolved as the basal clade of Phyllopoda with high support was most likely due to artifactual phylogenetic inferences, probably resulting from the high degree of heterogeneity. If we excluded Laevicaudata, no inferences can be made about the relationships of this taxon. Accordingly, we applied three exclusive uses of homoplastic sites methods to reduce the potential impact of compositional heterogeneity on phylogenetic inference. Using the concatenated sequences of the 13 PCGs or 13 mitochondrial proteins for phylogenetic inference was proven to be not impactful in reconciling model misspecification (Table 4). When only seven mitochondrial proteins or six PCGs with the lowest Z scores were used for the phylogenetic analyses, the Z scores were reduced, but the compositional heterogeneity was still significant (Z = 7.27 for 7 proteins; Z = 5.00 for 6 PCGs). However, when only the Paas7 matrix was used for phylogenetic analyses, the CAT model was no longer violated (p = 0.09, Z = −1.48). Therefore, the compositional heterogeneity in the concatenated sequences of mitochondrial proteins could be reduced to a degree that the CAT model was no longer violated.
Bayesian inference from the Pnuc6 dataset under a site-heterogeneous model recovered the monophyly of Diplostraca, but with low probability (BI nt -PP = 0.71; Figure 2a). The result indicated that the high support for Laevicaudata as the earliest branch of Phyllopoda under site-homogeneous models was partly due to among-lineage compositional bias. In contrast, the analysis of the Paa7 matrix using the CAT-mtREV+Γ 4 mixture model model did not support the monophyly of Diplostraca, and the result supported Laevicaudata as a sister to the rest of Phyllopoda (Figure 2b), as did site-composition homogeneous models. When we removed fast-evolving sites (46.45%) from Paa7 matrix, the monophyly of Diplostraca was recovered with high support (BI nt -PP = 0.96; Figure 2c), confirming that the removal of the fast-evolving positions increased the ratio of phylogenetic to nonphylogenetic signal [58]. This observation also implied that compositional heterogeneity and fast-evolving positions in the amino acid datasets were two sources of phylogenetic artifacts and nonstationary heterogeneous composition model showed significant improvements over site-homogenous models in the phylogenetic reconstruction.

Pervasiveness of Phylogenetic Conflicts
In this study, several datasets were utilized to test basal relationships of Diplostraca and to compare the results of phylogenetic inference from different evolutionary models. Nevertheless, the standard phylogenetic methods consistently failed to uncover the correct phylogeny (Figure 1 and Figure S1). High nodal supports concealed the pervasiveness of phylogenetic conflicts. The non-monophyly of Diplostraca supported by these analyses was an artifactual result overturning a key relationship supported by morphological cladistic studies [26][27][28]31] and phylogenomic analyses [29,30,32,59]. The monophyly of Diplostraca was supported based on the list of supporting synapomorphies, such as bivalved carapaces in adults, larvae with small and budlike first antennae, highly modified first male thoracopod pair for clasping females and trunk limb exopods in adults with long dorsal lobes [31]. Furthermore, the recovery of the monophyly of Diplostraca through the exclusive use of homoplastic sites and the application of site-heterogeneous models (Figure 2a) confirmed that the non-monophyly of Diplostraca was an artifact. In FcLM analyses based on nucleotide sequences, the majority of quartets supported Notostraca as the closest relatives of Laevicaudata ( Figure 3). This is congruent with part of the current results (Figure 2a and Figure  S1a). In FcLM analyses based on amino acid sequences, the majority of quartets supported Notostraca as the closest relatives of Onychocaudata ( Figure 4). This is again congruent with part of our results (Figure 2b and Figure S1b). Using amino acid sequences or the removal of fast-evolving sites was considered an efficient approach to reduce systematic errors and to resolve deep relationships [60][61][62]. However, the quartet puzzling analysis plotted the probability of the preferred: (((Laevicaudata, Onychocaudata), Notostraca) topology, Anostraca) and the probabilities only ranged from 1.1% to 26.4% (Figures 3 and 4). Measurement of phyologenetic signal showed 0.8% of unresolved quartets and 13.1% of partly resolved quartets presented in the Paas7 matrix (excluding 46.45% sites), and quartet support for preferred topology was still low (23.2%). These results suggested that the phylogenetic signal for a deep relationship between Laevicaudata and Onychocaudata was always weak and differed among amino acid datasets ( Figure 3). These findings could be explained by the decay of the phylogenetic signal or a limited signal in the mitogenomic sequences. The limitations of mitogenomes applied in deep phylogeny of Arthropod have already been pointed out [63] and emphasized [1,[4][5][6]64]. When the non-phylogenetic signal was higher than the phylogenetic signal due to mutational saturation, high AT-content, parasitic life-styles or explosive radiation events, considerable systematically erroneous relationships were recovered [6]. Our analyses confirmed these conclusions. ues Mol. Biol. 2023, 3, FOR PEER REVIEW 12

Heterogeneity and Tree Topology
The comparisons of AT% between Laevicaudata and Spinicaudata (Table 2), the ID test (Table 3) and the PPA test (Table 4) demonstrated a high degree of compositional heterogeneity among branchiopod mitogenomes at both the nucleotide and amino acid

Heterogeneity and Tree Topology
The comparisons of AT% between Laevicaudata and Spinicaudata (Table 2), the I D test (Table 3) and the PPA test (Table 4) demonstrated a high degree of compositional heterogeneity among branchiopod mitogenomes at both the nucleotide and amino acid levels, which could bias the phylogenetic inference. The CAT+GTR model used in Bayesian inference analyses, implemented in PhyloBayes 4.1c [46], was chosen for its superiority in accommodating site-heterogeneous patterns of molecular evolution [11,12]. However, the Bayesian analysis under the site-heterogeneous model for the amino acid dataset (Paa7 matrix) recovered a topology almost identical to the phylogenetic analysis under the sitehomologous model for the amino acid dataset of the PAA matrix (Figures 1b and 2b). This suggested that the homoplasy in amino acid datasets was not only due to compositional heterogeneity considered in site-heterogeneous model. When we applied amino acid recoding to our datasets (removing fast-evolving sites from Paa7 matrix; the Paas7 matrix) combined with the site-heterogeneous model, the monophyly of Diplostraca was recovered correctly (Figure 2c). Our study demonstrated that removing fast-evolving sites could be an effective method to overcome the among-site rate heterogeneity from nonstationarity [65,66]. Although phylogenetic resolution to the monophyly of Diplostraca was improved when we applied a variety of strategies to reduce the effects of saturation and heterogeneity, the deep relationships within Diplostraca were not fully resolved.
The sum of these analyses suggested that the phylogenetic resolution of Diplostraca using mitogenomes was trapped by conflicting phylogenetic signals existing across different genes, which in turn was aggravated by compositional heterogeneity and among-site rate heterogeneity. The phylogenetic signal and the potential influence of non-phylogenetic signal should be independently evaluated when mitogenomic datasets were applied in deep phylogeny.

Conclusions
In this study, we extensively dissected the potential sources of non-phylogenetic signal that resulted in high support but incorrect phylogenies when mitogenomes were applied in deep phylogeny. We identified significant compositional heterogeneity in both the nucleotide and amino acid datasets. Phylogenetic analyses under site-homogeneous models suggested that topological conflict at the base of Phyllopoda were retained across all datasets, even with the exclusion of the genes with the most strongly deviating compositions. Bayesian inference under the site-heterogeneous CAT-GTR+Γ 4 mixture model using the nucleotide dataset (Pnuc6) recovered the monophyly of Diplostraca. However, it is limited for the amino acid dataset, regardless of minimization of model violation. Although slow-evolving sites of the amino acid dataset (Paas7) under the site-heterogeneous model revealed the monophyly of Diplostraca with high support, the deep relationships among Laevicaudata, Spinicaudata and Cladocera were not fully resolved, which demonstrated systematic conflicts in phylogenetic signal. The results of FcLM analyses confirmed the systematic conflicts and revealed that the phylogenetic signal for deep relationship between Laevicaudata and Onychocaudata was significantly weaker than the nonphylogenetic signal across all datasets. Future analyses including the mitogenomes of the other laevicauatan species are needed to achieve a more complete understanding of the evolutionary history of Diplostraca by identifying more basal branches.
Supplementary Materials: The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/cimb45020054/s1, Figure S1: Phylogenetic trees of Branchiopoda (colour-coded) obtained with RAxML and MrBayes inferred from the datasets of Pnuc6 (a) and Paa7 (b). Supports at nodes are ML bootstrap values and Bayesian posterior probabilities; Table S1: List of primer combinations used to amplify the mitochondrial genome of L. grossipedia; Table S2: Details of species and mitogenomes of Branchiopoda used in this study; Table S3: Partition schemes and best-fitting models for phylogenetic analyses; Table S4: Results of the test for substitution saturation. The references  are cited in the Supplementary Materials.
Author Contributions: Conceptualization, X.S. and J.C.; writing-original draft preparation, X.S.; writing-review and editing, J.C. All authors have read and agreed to the published version of the manuscript.