Protein Structure, Models of Sequence Evolution, and Data Type Effects in Phylogenetic Analyses of Mitochondrial Data: A Case Study in Birds

: Phylogenomic analyses have revolutionized the study of biodiversity, but they have revealed that estimated tree topologies can depend, at least in part, on the subset of the genome that is analyzed. For example, estimates of trees for avian orders differ if protein-coding or noncoding data are analyzed. The bird tree is a good study system because the historical signal for relationships among orders is very weak, which should permit subtle non-historical signals to be identiﬁed, while monophyly of orders is strongly corroborated, allowing identiﬁcation of strong non-historical signals. Hydrophobic amino acids in mitochondrially-encoded proteins, which are expected to be found in transmembrane helices, have been hypothesized to be associated with non-historical signals. We tested this hypothesis by comparing the evolution of transmembrane helices and extramembrane segments of mitochondrial proteins from 420 bird species, sampled from most avian orders. We estimated amino acid exchangeabilities for both structural environments and assessed the performance of phylogenetic analysis using each data type. We compared those relative exchangeabilities with values calculated using a substitution matrix for transmembrane helices estimated using a variety of nuclear- and mitochondrially-encoded proteins, allowing us to compare the bird-speciﬁc mitochondrial models with a general model of transmembrane protein evolution. To complement our amino acid analyses, we examined the impact of protein structure on patterns of nucleotide evolution. Models of transmembrane and extramembrane sequence evolution for amino acids and nucleotides exhibited striking differences, but there was no evidence for strong topological data type effects. However, incorporating protein structure into analyses of mitochondrially-encoded proteins improved model ﬁt. Thus, we believe that considering protein structure will improve analyses of mitogenomic data, both in birds and in other taxa.


Introduction
The accumulation of molecular data has revolutionized our ability to understand biodiversity, especially since the dawn of the phylogenomic era approximately 20 years ago [1,2]. However, phylogenomics has also revealed that many conflicting signals can emerge when different parts of the genome are analyzed [3]. It has long been appreciated that there are a variety of processes that can create genuine discordance among gene trees [4,5] and the ability to collect large amounts of data that can capture the variation among gene trees has led to a paradigm shift in systematics [6]. In fact, mathematical models that describe discordance due to the multispecies coalescent, arguably the most prominent source of genuine conflicts among gene trees, are now quite mature [7,8]. However, efforts to estimate species trees and to understand the amount of genuine discordance among gene trees are complicated by two sources of error: stochastic and systematic error [3]. Stochastic error is a simple consequence of the fact that all results The mitochondrially-encoded subset of the animal proteome might be a useful "model system" for the study of protein structure data type effects. A classic study by Naylor and Brown [33,34] (hereafter NB) showed that different topological signals are associated with distinct subsets of amino acids in mitochondrially-encoded proteins. More specifically, NB found that sites dominated by hydrophobic residues had a poor fit to a number of strongly corroborated relationships in the vertebrate species tree based on the maximum parsimony (MP) criterion. This suggests that mitochondrially-encoded proteins will exhibit a structural data type effect because all proteins encoded by vertebrate mitogenomes are transmembrane proteins [35,36] and hydrophobic amino acids are concentrated in transmembrane (TM) helices. Thus, we expect exhibit distinct topological signals to be evident if we define TM helices and extra-membrane (ExM) loops as the two data types to consider. The central question is how to detect that data type effect, if it exists, in other taxonomic groups. The "known phylogeny" approach, used by NB, suffers from the fact that any phylogeny that can be viewed as "known" is likely to be characterized by a strong historical signal (i.e., it will have many site patterns that support bipartitions in the true tree). After all, it is the existence of a strong historical signal that provides the corroboration of relationships that causes systematists to view the phylogeny as known. Unless the non-historical signal(s) (site patterns that support bipartitions that are not present in the true tree) are equally strong they are likely to be overwhelmed by strong historical signals, rendering weak non-historical signals essentially undetectable. Thus, the ideal datasets to examine for data type effects are those for which the historical signal is very weak; the relationships among avian orders (Figure 1) represent such a phylogeny.
Takezaki and Gojobori [37] challenged the broader implications of the NB results by showing that using models of evolution that incorporate among-sites rate variation ameliorates the poor fit of the hydrophobic residues to vertebrate phylogeny. Virtually all of the programs currently used in modern phylogenetic analyses, such as the fast maximum likelihood (ML) program IQ-TREE [38], implement models that incorporate among-sites rate heterogeneity. Although this suggests that relatively simple model improvements might eliminate the data type effect implied by the NB results, they do not necessarily indicate that adding among-sites rate heterogeneity to analytical models in the most straightforward manner (the discrete approximation to the Γ distribution [39]) will be a panacea for topological errors in analyses of mitogenomic data. Indeed, more recent studies indicate that the details of the rate-heterogeneity model can have an impact on estimates of phylogeny for mitogenomic data [40,41]. Moreover, many phylogenetic analyses of metazoan mitogenomes have revealed evidence of systematic biases [42][43][44][45][46][47][48][49] and the sources of those errors is far from clear.
In addition to their potential to improve phylogenetic estimation, models of sequence evolution can provide insights into the underlying processes of molecular evolution [31]. Examining the evolution of TM and ExM sites in a broadly sampled set of mitogenomes (in this study, sampled from birds) has the potential to yield a number of insights. When the results of Jones et al. [50] and Liò and Goldman [51] (which largely reflect nuclear-encoded TM proteins) are considered in light of the support for different relative exchangeabilities of amino acids in distinct structural environments [17,[29][30][31], it seems likely that analyses focused on mitochondrially-encoded proteins will yield evidence of model differences between data types. If those model differences result in model misspecification for at least one of those data types, we might find evidence for strong data type effects (strong support for clades that conflict with the monophyly of the strongly corroborated avian orders), weak data type effects (strong topological conflicts for the weakly-supported relationships among orders), or both.

Figure 1.
Consensus phylogeny of birds based on phylogenomic data. This cladogram reflects a recent phylogenomic supertree analysis [52] modified based on the results of two more recent phylogenomic studies [14,53]; relationships that are highly uncertain are presented as polytomies. Most terminal taxa correspond to orders as defined in the IOC World Bird List v. 6.1, with the exception of the IOC Caprimulgiformes (clade V) where we used the ordinal definitions of Chen et al. [54,55]. These ordinal definitions are strongly corroborated so we view their monophyly as "known." Roman numerals indicate the "magnificent seven" superordinal clades defined by Reddy et al. [10]; the historical signal uniting the magnificent seven is weak, but they are relatively well corroborated. The dashed line highlights an exception; support for the position of Musophagiformes is especially weak [14], this is not relevant to the present study given our taxon sample. Three additional clades are indicated using letters: "N" (within Palaeognathae) indicates Notopalaeognathae (non-ostrich paleognaths [56]); "D" (within clade V) indicates Daedalornithes (owlet-nightjars, swifts, and hummingbirds [57]); and "E" (within Passeriformes) indicates Eupasseres (all passerines except the New Zealand wrens [58]). Relationships within two selected orders are also shown; they were chosen because they highlight relationships where the positions of taxa in published mitochondrial phylogenies differed from the position in nuclear phylogenies [42,59,60]. Orders and families without a complete (or nearly complete) mitogenome sequence included in this analysis are presented in gray.
Here, we conducted a study motivated by the classic NB studies and previous work on models of TM protein evolution [50,51]. We generated an aligned data matrix comprising the 12 proteins encoded by the heavy strand of the avian mitogenome sampled from 420 bird species, annotated the alignment with structural information, and used those data to examine three predictions that emerge when the NB studies are considered. First, we predicted that if we use the 20-state general time-reversible (GTR20) model to estimate the relative exchangeabilities of amino acids in TM versus ExM environments we would Consensus phylogeny of birds based on phylogenomic data. This cladogram reflects a recent phylogenomic supertree analysis [52] modified based on the results of two more recent phylogenomic studies [14,53]; relationships that are highly uncertain are presented as polytomies. Most terminal taxa correspond to orders as defined in the IOC World Bird List v. 6.1, with the exception of the IOC Caprimulgiformes (clade V) where we used the ordinal definitions of Chen et al. [54,55]. These ordinal definitions are strongly corroborated so we view their monophyly as "known." Roman numerals indicate the "magnificent seven" superordinal clades defined by Reddy et al. [10]; the historical signal uniting the magnificent seven is weak, but they are relatively well corroborated. The dashed line highlights an exception; support for the position of Musophagiformes is especially weak [14], this is not relevant to the present study given our taxon sample. Three additional clades are indicated using letters: "N" (within Palaeognathae) indicates Notopalaeognathae (non-ostrich paleognaths [56]); "D" (within clade V) indicates Daedalornithes (owlet-nightjars, swifts, and hummingbirds [57]); and "E" (within Passeriformes) indicates Eupasseres (all passerines except the New Zealand wrens [58]). Relationships within two selected orders are also shown; they were chosen because they highlight relationships where the positions of taxa in published mitochondrial phylogenies differed from the position in nuclear phylogenies [42,59,60]. Orders and families without a complete (or nearly complete) mitogenome sequence included in this analysis are presented in gray.
Here, we conducted a study motivated by the classic NB studies and previous work on models of TM protein evolution [50,51]. We generated an aligned data matrix comprising the 12 proteins encoded by the heavy strand of the avian mitogenome sampled from 420 bird species, annotated the alignment with structural information, and used those data to examine three predictions that emerge when the NB studies are considered. First, we predicted that if we use the 20-state general time-reversible (GTR 20 ) model to estimate the relative exchangeabilities of amino acids in TM versus ExM environments we would find evidence for very different parameter values. This prediction is already corroborated by other studies focused on transmembrane protein evolution [50,51], so it is very likely to be true. However, we can make a more specific prediction regarding the patterns we are likely to see in our estimated rate matrices: we predicted that relative exchangeabilities for pairs of amino acids that are rare in a particular structural environment would be elevated in mitochondrially-encoded proteins because this has already been shown for globular proteins [31]. Second, we expected phylogenetic analyses of the ExM loops to perform better than analyses of TM helices. Since the relationships among avian orders are highly uncertain (Figure 1) we tested this prediction by examining the monophyly of orders (monophyly of avian orders as they are currently circumscribed is strongly corroborated; reviewed by Braun et al. [61]). Third, we expected different topological signals to emerge in phylogenetic analyses of each data type. Even if there were no strong non-historical signals, it seems likely that even very weak biases might perturb the highly uncertain portions of the bird tree ( Figure 1). We then used a mixture model framework to determine whether there were model violations that remained after estimating GTR 20 rate matrices for each data type. To complement our analyses of amino acid data, we analyzed the nucleotide sequences for each data type (including analyses conducted after RY-coding, in which the data are encoded as purines or pyrimidines). These analyses provided insights into the processes of molecular evolution for mitochondrially-encoded proteins and they have the potential to improve phylogenetic analyses of mitochondrial sequences, a major tool in the study of biodiversity.

Data Matrix Construction
We started with the alignment used by Nabholz et al. [62], which includes 92 taxa, identified gene boundaries and began adding annotated coding regions for each of the 12 proteins encoded on the heavy strand of the avian mitogenome. We added sequences from taxa with complete or nearly complete mitogenome sequences and the coding regions from one study [63] where the sequences for each gene were obtained separately from the same specimen. Sequences were aligned by eye because avian mitochondrial coding regions have few indels and they are easy to align. We did not construct chimeric sequences from multiple individuals. Ultimately, this resulted in a data matrix with 420 species. After translating the sequences we used the TM helix boundaries annotated for the chicken (Gallus gallus) in UniProt [64] to create a NEXUS charset [65] for the TM helices. Although the lengths of TM helices can vary depending on the tilt angle of the helix [66], their lengths are highly constrained by the width of the lipid bilayer. Thus, we believed that it was reasonable to assume that the sites were either associated with TM helices or ExM segments across all birds. These datasets are available as Supplementary File S1.

Analyses of Molecular Evolution and Phylogeny
We used IQ-TREE version 2.0.6 [38] for all tree estimation and we assessed support using the ultrafast bootstrap [67], with 1000 replicates. We used the Bayesian information criterion (BIC) [68] values calculated by IQ-TREE to identify the best-fitting model.
We analyzed three amino acid datasets (TM sites, ExM sites, and all sites) using the GTR 20 and mtVer [69] models. We accommodated among sites rate heterogeneity using a combination of invariant sites and Γ-distributed rates across sites. We used empirical amino acid frequencies (+F) for the mtVer. For the partitioned analysis, we fixed R matrix parameters at the values estimated using the separate TM and ExM alignments, which we call the bird mtTM model and bird mtExM model (hereafter, TM and ExM will be used as abbreviations for transmembrane and extramembrane sites while mtTM and mtExM will be used for the R matrices). The mixture model (bird mtMIX) was constructed using the bird mtTM and bird mtExM R matrices as the two mixture components with the rate Diversity 2021, 13, 555 6 of 26 of each mixture component set to a value proportional to the tree lengths (the sum of all ML branch length estimates) for each separate analysis; the relative rates (rounded to three decimal places) were mtTM = 0.918 and mtExM = 1.082. We assumed Γ-distributed rates to accommodate rate heterogeneity beyond that of the mixture component rates. We estimated mixture weights by ML and calculated the relative contributions to the site likelihoods using the -wslm option. We generated a generalized TM helix model to compare with the bird mtTM model; we generated this model (JTTtm) by using the DCMut method [70] method to convert the data in Jones et al. [50] into an R matrix. All R matrices (bird mtTM, bird mtExM, and JTTtm) are available in PAML format in Supplementary File S2 and https://github.com/ebraun68/protmodels (accessed on 26 September 2021). The bird mtMIX model is also available as a NEXUS models block, which can be read by IQ-TREE (this file includes unrounded values for the mixture component rates).
We conducted four analyses of nucleotide data, all of which were partitioned by codon position. As with the amino acid datasets, we analyzed three nucleotide datasets: (1) TM sites; (2) ExM sites; and (3) all sites. We conducted two analyses of the all-sites data, one using three partitions (the codon positions) and a second with six partitions (the three codon positions for TM sites and the three codon positions in the ExM sites). The same four analyses were conducted using binary (RY) versions of the three datasets. Since the IQ-TREE binary model uses 0 and 1 as character states, we actually coded the data as purines = 0 and pyrimidines = 1; we generated the binary data matrix using recodeRY.pl, available from https://github.com/ebraun68/RYcode (accessed on 26 September 2021).
We assessed the topological distances among trees using matching distances [71,72], calculated in PAUP* 4.0a169 [73]. We used the Kimball et al. [52] supertree (specifically, the matrix representation of the parsimony supertree from that paper) as our estimate of the avian species tree. To facilitate comparisons between estimates of the mitogenomic tree and the Kimball supertree, we reduced the trees to a set of 51 taxa, each of which represent major lineages that were monophyletic in the mitogenomic tree. All trees are included in Supplementary File S3. Taxa used for the comparison with the Kimball supertree are included in that file as a taxset. We visualized distances among trees by clustering the matching distances using neighbor joining [74]. The matrix of matching distances is available in Supplementary File S4.
We used a simple dataset subdivision similar to the Farris et al. [75,76] incongruence length difference (ILD) test to assess the differences between the TM and ExM data types. Briefly, we generated 100 randomly subdivided dataset pairs, where one data subset had the same number of sites as the TM sites and the other had the same number of sites as the ExM sites. The ILD test uses the sum of the MP treelengths for the optimal trees for each data subset as the test statistic; we eschewed the use of MP treelengths because they can confound topology and model. Instead, we used three different test statistics: (1) Euclidean distances between vectors of normalized R matrix parameters; (2) Euclidean distances between vectors of amino acid frequencies; and (3) topological distances (matching distances). This separates model differences (captured by two Euclidean distances) from topological differences. Euclidean distances were calculated using a program written by E.L.B. and available from https://github.com/ebraun68/protmodels (accessed on 26 September 2021). The use of dataset subdivision and model distances might be seen as yielding results similar to the BIC, but we believe it might have more power when the number of free parameters is large, many parameters are relatively constrained, and the set of parameters that differ is difficult to predict. This is the case for the GTR 20 model.

Do the mtTM (Transmembrane) and mtExM (Extramembrane) Models Differ?
We estimated relative exchangeability (R matrix) and amino acid frequency parameters for the TM and ExM sites using the GTR 20 and mtVer models (+I + Γ rate heterogeneity, see Methods); GTR 20 had a better fit to both datasets (∆BIC for TM = 677.0478 and ∆BIC for ExM = 861.4969). This suggests the relative exchangeability parameters for the two Diversity 2021, 13, 555 7 of 26 data types exhibit significant differences. We used a random dataset subdivision to determine whether that was true; we asked whether the distances between model parameters estimated using TM versus ExM sites exceeded our null expectation. Our null hypothesis was that the two data types are best described by very similar models (i.e., the model distances will be low). The observed distances between models for the TM and ExM sites fell outside the null distribution for the R matrices and for amino acid frequencies ( Figure 2). These results corroborated our first prediction (that the distances between estimated model parameters for TM and ExM sites were greater than expected by chance). ΔBIC for ExM = 861.4969). This suggests the relative exchangeability parameters for the two data types exhibit significant differences. We used a random dataset subdivision to determine whether that was true; we asked whether the distances between model parameters estimated using TM versus ExM sites exceeded our null expectation. Our null hypothesis was that the two data types are best described by very similar models (i.e., the model distances will be low). The observed distances between models for the TM and ExM sites fell outside the null distribution for the R matrices and for amino acid frequencies ( Figure 2). These results corroborated our first prediction (that the distances between estimated model parameters for TM and ExM sites were greater than expected by chance). Comparing our novel mtTM and mtExM models to other TM and mitochondrial models can provide insights into the patterns of molecular evolution for each data type. The parameters that are most obviously expected to differ between TM and ExM models are the amino acid frequency parameters and the existence of this difference is strongly corroborated by our random subdivision test ( Figure 2). As stated in the introduction, TM helices are expected to be enriched for hydrophobic residues whereas ExM segments will be enriched for polar residues. This is exactly what we observed when the bird mtTM and mtExM matrices were compared (the blue boxes in Figure 3 indicate cases where the two TM matrices have a higher amino acid frequency parameter than the bird mtExM matrix). All nine of the amino acids with an elevated amino acid frequency in bird mtTM that was elevated relative to mtExM had very low to moderate Grantham [77] polarity values; seven of those nine amino acids (L, I, F, W, C, M, and V) form a group at the very lowest end of the Grantham polarity scale (Supplementary File S2). Jones et al. [50] reported data for a TM helix mutation data matrix based on nuclear-and mitochondrially-encoded transmembrane proteins from a variety of taxa; we derived the JTTtm matrix ( Figure 3a) using their data. There were a few differences in the set of amino acids enriched in JTTtm versus those enriched in bird mtTM, but the set of amino acid frequencies in JTTtm that were elevated relative to bird mtExM (L, I, F, C, V, Y, A, and G) was quite similar to the set enriched in bird mtTM. Comparing our novel mtTM and mtExM models to other TM and mitochondrial models can provide insights into the patterns of molecular evolution for each data type. The parameters that are most obviously expected to differ between TM and ExM models are the amino acid frequency parameters and the existence of this difference is strongly corroborated by our random subdivision test ( Figure 2). As stated in the introduction, TM helices are expected to be enriched for hydrophobic residues whereas ExM segments will be enriched for polar residues. This is exactly what we observed when the bird mtTM and mtExM matrices were compared (the blue boxes in Figure 3 indicate cases where the two TM matrices have a higher amino acid frequency parameter than the bird mtExM matrix). All nine of the amino acids with an elevated amino acid frequency in bird mtTM that was elevated relative to mtExM had very low to moderate Grantham [77] polarity values; seven of those nine amino acids (L, I, F, W, C, M, and V) form a group at the very lowest end of the Grantham polarity scale (Supplementary File S2). Jones et al. [50] reported data for a TM helix mutation data matrix based on nuclear-and mitochondrially-encoded transmembrane proteins from a variety of taxa; we derived the JTTtm matrix ( Figure 3a) using their data. There were a few differences in the set of amino acids enriched in JTTtm versus those enriched in bird mtTM, but the set of amino acid frequencies in JTTtm that were elevated relative to bird mtExM (L, I, F, C, V, Y, A, and G) was quite similar to the set enriched in bird mtTM.
Differences in amino acid exchangeability (R matrix) parameters were also evident ( Figure 3). Polar-polar exchangeabilities (e.g., N-K, D-E, and N-D) were elevated relative to bird mtExM in both TM matrices whereas hydrophobic-hydrophobic exchangeabilities (e.g., I-V, M-V, and F-Y) were elevated in mtExM (Supplementary File S2). However, the largest relative exchangeability parameters in the mtTM matrix in absolute terms were not polar-polar; they were I-V and H-Y instead. The largest exchangeability in JTTtm was a polar-polar exchange (R-K), which also has a relatively high value in the bird mtTM matrix, albeit not to the same degree ( Figure 3). Regardless, it is clear that there are substantial  [69], which wa trained using all sites in mitochondrially-encoded proteins from diverse vertebrates. The TM mod els are inside the blue box and the mitochondrial models are inside the red box. All matrices wer normalized to have a maximum exchangeability of 100. Progressively darker shades of red are use for larger relative exchangeability values. Amino acid frequency parameters highlighted in blue i the TM models have values that are higher than the bird ExM amino acid frequency. These R ma trices available from https://github.com/ebraun68/protmodels (accessed on 26 September 2021) an in Supplementary File S2.
Differences in amino acid exchangeability (R matrix) parameters were also eviden ( Figure 3). Polar-polar exchangeabilities (e.g., N-K, D-E, and N-D) were elevated relativ to bird mtExM in both TM matrices whereas hydrophobic-hydrophobic exchangeabilitie (e.g., I-V, M-V, and F-Y) were elevated in mtExM (Supplementary File S2). However, th largest relative exchangeability parameters in the mtTM matrix in absolute terms wer

TM Helix and ExM Loops Tree Topologies: Stochastic Error, Not Data Type Effects
ML analyses of amino acid alignments of both data types yielded trees with similar treelengths but a large number of differences for the relationships among orders (Figure 4). The TM tree and the ExM tree both exhibited substantial conflict with the best available estimates of the bird tree ( Figure 1). Although this is consistent with the results of published broadly sampled mitogenomic trees of birds [62,78], it emphasized the fact that the additional taxon sampling in this study did not result in increased support.
Diversity 2021, 13, x FOR PEER REVIEW 9 of 27 substantial differences between models of TM helix versus ExM loop evolution, as expected based on our first prediction.

TM Helix and ExM Loops Tree Topologies: Stochastic Error, Not Data Type Effects
ML analyses of amino acid alignments of both data types yielded trees with similar treelengths but a large number of differences for the relationships among orders ( Figure  4). The TM tree and the ExM tree both exhibited substantial conflict with the best available estimates of the bird tree ( Figure 1). Although this is consistent with the results of published broadly sampled mitogenomic trees of birds [62,78], it emphasized the fact that the additional taxon sampling in this study did not result in increased support. Most tips reflect multiple taxa, with orders collapsed to yield a single tip whenever they were monophyletic. Cases where taxa in the same order were not recovered as monophyletic in at least one of the analyses (e.g., Accipitriformes, Suliformes, and Gruiformes) are presented as two or more tips with information regarding the subset of the order that the tip represents in parentheses. Boxes to the right of each tree indicate clades highlighted in the results. Complete trees with branch lengths and ultrafast bootstrap support for all branches are available as a Nexus format treefile in Supplementary File S3.
In contrast to our second prediction, neither data type appeared to perform substantially better based on the "known clade" criterion. Analysis of TM sites recovered Notopalaeognathae, Phasianidae + Odontophoridae, and Eupasseres whereas analysis of ExM sites recovered monophyly of the order Gruiformes and two magnificent seven clades: V (Strisores [55]) and VII (Mirandornithes [79]). Although there were cases where analyses of both data types yielded 100% support for specific clades, support for orders and other strongly corroborated clades was often surprisingly low (Table 1). Conducting a combined analysis of all sites often increased support relative to analyses of the individual Most tips reflect multiple taxa, with orders collapsed to yield a single tip whenever they were monophyletic. Cases where taxa in the same order were not recovered as monophyletic in at least one of the analyses (e.g., Accipitriformes, Suliformes, and Gruiformes) are presented as two or more tips with information regarding the subset of the order that the tip represents in parentheses. Boxes to the right of each tree indicate clades highlighted in the results. Complete trees with branch lengths and ultrafast bootstrap support for all branches are available as a Nexus format treefile in Supplementary File S3.
In contrast to our second prediction, neither data type appeared to perform substantially better based on the "known clade" criterion. Analysis of TM sites recovered Notopalaeognathae, Phasianidae + Odontophoridae, and Eupasseres whereas analysis of ExM sites recovered monophyly of the order Gruiformes and two magnificent seven clades: V (Strisores [55]) and VII (Mirandornithes [79]). Although there were cases where analyses of both data types yielded 100% support for specific clades, support for orders and other strongly corroborated clades was often surprisingly low (Table 1). Conducting a combined analysis of all sites often increased support relative to analyses of the individual data types, as expected if the primary reason for differences between the analyses of TM and ExM sites was increased stochastic error due to the smaller size of the data subsets. When there were conflicts between the analyses of the TM and ExM site, the combined analyses did not appear to agree with one subset more than the other (Table 1). Results were similar when analyses were conducted using the mtVer model (Supplementary File S3), although the fit of this model was not as good as the fit of the GTR 20 + I + Γ model (see above, Section 3.1).  1 We present ultrafast bootstrap support for clades present in the optimal tree and we have shaded support values when analyses of the data subsets disagree. In those cases, we shaded cells light gray if they agree with our best estimate of the avian species tree and we shaded cells black with white text if they conflict with our best estimate of the avian species tree. 2 Clades were included if they met one of these three criteria: (1) they were members of the "magnificent seven"; (2) they had <100% support in at least one analysis; or (3) they included a subclade that met the second criterion. 3 We have highlighted a small number of groups that are unlikely to be present in the avian species tree. The putative clades that are unlikely to be correct begin with (-) and are underlined. 4 Although some studies [12,80] have supported a Charadriiformes+Gruiformes clade we do not view that clade to be sufficiently corroborated to be scored in this table. Therefore, we designate these orders as "orphans" to indicate that they are not members of the "magnificent seven" superordinal clades.
There are two clades that could reflect data type effects based on the support values in Table 1: Notopalaeognathae and the Odontophoridae + Phasianidae clade. In both cases, there is conflict between the TM and ExM trees and support is higher in the TM tree than it is in the all-sites tree. This suggests that the topological signal in each data type actually conflicts. This pattern contrasts with Mirandornithes and Strisores; both of those clades are present in the ExM tree and absent in the TM tree but the all-sites tree had substantially higher support than the ExM tree. This suggests that there is hidden support [81,82] for both Mirandornithes and Strisores in the TM data). In all of the cases we highlighted, the TM tree includes a signal congruent with the likely topology (albeit mixed in the case of Mirandornithes and Strisores) of the true mitogenomic tree. This suggests that ExM sites might perform slightly worse than TM sites, which is the opposite of our prediction.
Our third prediction was that phylogenetic analyses of TM and ExM sites will yield significantly different tree topologies. It is possible to exclude the existence of strongly misleading data type effects because we did not recover strong support for any backbone relationships (Table 1 and Supplementary File S3). Despite the obvious differences between the tree topologies we recovered (Figure 4), the low support along the backbone and for many orders (Table 1 and Supplementary File S3) led us to postulate that the topological differences simply reflect the stochastic error associated with dividing the complete mitochondrial protein alignment into smaller sub-alignments for the TM and ExM sites. We calculated topological distances for the 100 randomly subdivided datasets used above. Unlike the case for model distances, the topological distance between the TM and ExM trees fell within the null distribution ( Figure 5), with analyses of nine of the 100 randomly subdivided dataset pairs yielding trees with higher matching distances. Although we acknowledge that the topological distance between the TM and ExM trees fell at the upper end of the null distribution and that much of the topological similarity between the TM and ExM trees appears to reflect nodes closer to the tips (284 out of 417 possible internal branches were present in a strict consensus of the TM and ExM trees), we believe that these results are best interpreted as evidence for strong stochastic error due to the reduced size of the TM and ExM data matrices. Thus, we were unable to corroborate our third prediction (that topological distances between trees estimated using TM versus ExM sites would be greater than expected by chance. lighted, the TM tree includes a signal congruent with the likely topology (albeit m the case of Mirandornithes and Strisores) of the true mitogenomic tree. This sugg ExM sites might perform slightly worse than TM sites, which is the opposite of diction.
Our third prediction was that phylogenetic analyses of TM and ExM sites w significantly different tree topologies. It is possible to exclude the existence of misleading data type effects because we did not recover strong support for any b relationships (Table 1 and Supplementary File S3). Despite the obvious differe tween the tree topologies we recovered (Figure 4), the low support along the b and for many orders (Table 1 and Supplementary File S3) led us to postulate that ological differences simply reflect the stochastic error associated with dividing t plete mitochondrial protein alignment into smaller sub-alignments for the TM a sites. We calculated topological distances for the 100 randomly subdivided datas above. Unlike the case for model distances, the topological distance between the ExM trees fell within the null distribution ( Figure 5), with analyses of nine of the domly subdivided dataset pairs yielding trees with higher matching distances. A we acknowledge that the topological distance between the TM and ExM trees fe upper end of the null distribution and that much of the topological similarity betw TM and ExM trees appears to reflect nodes closer to the tips (284 out of 417 possib nal branches were present in a strict consensus of the TM and ExM trees), we beli these results are best interpreted as evidence for strong stochastic error due to the size of the TM and ExM data matrices. Thus, we were unable to corroborate o prediction (that topological distances between trees estimated using TM versus E would be greater than expected by chance.

Is There Evidence for Heterogeneity within TM and ExM Sites?
One type of model misspecification might be the assumption of homogeneity within each data type implicit in our analyses. If the bird mtTM and bird mtExM matrices are good approximating models for each data type we would expect them to exhibit a better fit to the vast majority of sites within the appropriate data type (i.e., bird mtTM would fit TM sites better than bird mtExM and vice versa). It is straightforward to test this by fitting a two-component mixture model, with one component corresponding to bird mtTM R matrix and the second component corresponding to bird mtExM R matrix. Since there are clear differences between the models for TM and ExM sites (Figures 2 and 3) we expect the mixture model that combines bird mtTM and bird mtExM, which we call bird mtMIX, to fit the data better than a single matrix. This is precisely what we found (∆BIC for bird mtMIX relative to all-sites GTR 20 + I + Γ = 5045.4351).
It is possible to make two predictions about the behavior of the bird mtMIX model if the patterns of sequence evolution for TM and ExM sites differ between data types but are relatively homogeneous within each data type: 1) estimates of the mixture weights for each component will be close to the proportions of sites in each data type; and 2) the contributions of each mixture component to each site likelihood are expected to differ for the two data types and be largely non-overlapping (see Pagel and Meade [83] for an illustration of the second prediction). This is not what we found (Table 2 and Supplementary File S5); the ML estimate of the mixture weight for the bird mtTM model component was higher than the proportion of TM sites and the weight of the bird mtExM mixture component was lower than the proportion of ExM sites. The contributions of each mixture component to the site likelihoods (the lnL difference in Table 2) are in broad agreement with our second prediction. The median contribution of the bird mtTM mixture component to the likelihoods of TM sites was higher than the median contribution of the bird mtTM mixture component to ExM sites. However, there was a wider range of contributions of each model component to the site likelihoods than we expected. The bird mtTM mixture component made a surprisingly large contribution to the likelihood of many ExM sites. In fact, the median contribution of the bird mtTM mixture component was very close to zero, indicating that the bird mtTM mixture component actually makes the larger contribution to the likelihood of half of the ExM sites. Overall, these results indicate that the patterns of sequence evolution in heavy strand-encoded mitochondrial proteins is more complex than one might predict based on the straightforward assumption that sites have evolved under two models, one for TM sites and one for ExM sites. Estimates of phylogeny generated using partitioned analysis and mixture models were generally similar to the unpartitioned tree (Table 3). Unsurprisingly, both the partitioned analysis and use of the bird mtMIX model resulted in a better fit to the complete data matrix than the GTR 20 + I + Γ model with parameters estimated using all sites (∆BIC for partitioned analysis = 2480.0035; ∆BIC for bird mtMIX = 5045.4351). A strict consensus of the unpartitioned and partitioned trees had 377 resolved branches (90.4% of the potential branches) and a strict consensus of the unpartitioned and bird mtMIX tree had 383 resolved branches (91.8% of the potential branches). Support for various clades in the partitioned and bird mtMIX trees was generally similar to support in the unpartitioned all-sites tree (compare the values in Table 3 to the all-sites column in Table 1). Both partitioned analysis and use of the mixture model had an impact on branch length estimates; relative to tree resulting from the all-sites unpartitioned analysis, the bird mtMIX treelength was 1.148 and the partitioned analysis treelength was 1.235. The ratio of the sum of the internal branch lengths to the total treelength was virtually identical across analyses (31.98% for the bird mtMIX model, 32.019% for partitioned analysis, and 32.03% for the unpartitioned analysis). However, it will be necessary to conduct simulations to understand whether the branch length estimates based on analyses using mtMIX or partitioned analyses are closer to the true branch lengths.

Protein Structure Has an Impact on Analyses of Nucleotide and Purine-Pyrimidine Data
Arguably, mitochondrial sequence data have the greatest potential as sources of information for biodiversity studies near the tips of the vertebrate tree of life [84][85][86]. Thus, it would be desirable to assess the impact of protein structure on analyses of nucleotide data. For our partitioned analyses of the TM and ExM codons (three partitions, one for each codon position), the TM and ExM nucleotide trees exhibit a number of differences from the trees based on amino acid data (Table 4 and Supplementary File S3). We did not observe a simple pattern of either increased or decreased congruence with the likely species tree.  1 We have shaded support values when analyses of the data subsets disagree. Cells were shaded light gray if they agree with our best estimate of the avian species tree and black with white text if they conflict with our best estimate of the avian species tree. 2 We have added two groups that are unlikely to be correct because they appeared in nucleotide analyses and they relate to the topology for Palaeognathe (see discussion for additional information). 3 All Charadriiformes except Turnix sylvaticus form a clade with 100% support in the three-partition nucleotide analysis of all sites. The family Turnicidae (hemipodes) has a long branch in many analyses of molecular data [56,61,63].
The six-partition analysis of all sites (partitioning by structure and codon position) improved the fit to the data (∆BIC favoring the six-partition analysis = 1811.5226) relative to three partitions (partitioning by codon position alone). The six-partition tree exhibited a number of differences from the trees based on separate analyses of TM and ExM sites and the three partition all-sites tree. The most notable difference between the three partition and six partition trees was the non-monophyly of Charadriiformes and Gruiformes in the former and the strongly supported monophyly of those orders in the six-partition analysis (Table 4). That result was surprising because separate nucleotide analysis of TM and ExM data yielded trees with monophyly of Charadriiformes and Gruiformes. The estimated nucleotide frequencies for TM sites and ExM sites were very different ( Table 5), suggesting that the three-partition analysis resulted in model misspecification that, based on the topological results, had a meaningful impact on phylogenetic estimation. Table 5. ML estimates of base frequencies and relative partition rates in the analyses of nucleotide sequences. Recoding nucleotide data as two states (purines and pyrimidines; typically called RY-coding) has been used in a number of studies, especially those using mitochondrial data. In fact, RY-coding has resulted in very clear improvements to estimates of avian phylogeny when limited taxon samples are used [43]. When we used RY-coding for the four analyses conducted using nucleotide data (Table 6), we did observe several differences. One notable shift relative to four-state data was the support in Notopalaegnathae in the TM sites; however, other analyses (ExM sites, all sites/three partitions, and all sites/six partitions) all placed Rheiformes sister to other Palaeognathae, similar to some of the nucleotide analyses. Although this shift represented greater congruence with the species tree, we noticed that analyses of TM sites after RY-coding also resulted in the loss of Strigiformes monophyly (Table 6). This was surprising given the high support for Strigiformes in other analyses (Tables 1, 3, 4 and 6). Similar to the analyses using nucleotide data, the six-partition model had a better fit to the data than the three-partition model (∆BIC favoring the six-partition RY analysis = 303.0233). This is likely to represent the fact that the hydrophobic amino acids I, L, M, F, and V, which are enriched in the TM helices (Figure 3), have codons with T in their second position. These results emphasize that researchers should consider protein structure when conducting analyses of mitochondrial nucleotide sequences, regardless of whether or not they employed RY-coding.

Multiple Factors Shape the Tree Space for Analyses of Mitochondrial Proteins
We assessed the topological distances among our estimates of the mitogenomic tree for birds and between those trees and the likely species tree (represented by the Kimball et al. [52] supertree). It was necessary to reduce the taxon sample to compare our mitogenomic trees to the Kimball supertree. This limited the comparisons major clades, although we did capture all of the topological variation highlighted in the tables along with all relationships among orders. The matching distances between the mitogenomic trees and the Kimball supertree ranged from 131 to 200 ( Figure 5), much lower than expected for matches among random trees (the median matching distances for a sample of 1000 random trees was 428; 95% of comparisons fell in the range of 369-496). Thus, the topological distances between the Kimball supertree and the mitogenomic trees was between 31% and 47% of the expected distance for pairs of random trees. The ExM amino acid data clustered with the Kimball supertree, but the distance to the TM amino acid trees was only slightly higher in absolute terms ( Figure 6 and Supplementary File S4). The nucleotide trees clustered in tree space and the trees estimated using the same datasets (TM, ExM, and all sites) clustered regardless of whether they were two-state (RY) or four-state (unaltered nucleotide data) trees. The most striking pattern was the large distances among all estimates of the mitogenomic tree and between estimates of the mitogenomic tree and the Kimball supertree.  1 We have shaded support values when analyses of the data subsets disagree. Cells were shaded light gray if they agree with our best estimate of the avian species tree and black with white text if they conflict with our best estimate of the avian species tree.
Diversity 2021, 13, x FOR PEER REVIEW 17 of 27 clustered in tree space and the trees estimated using the same datasets (TM, ExM, and all sites) clustered regardless of whether they were two-state (RY) or four-state (unaltered nucleotide data) trees. The most striking pattern was the large distances among all estimates of the mitogenomic tree and between estimates of the mitogenomic tree and the Kimball supertree. Figure 6. Dendrogram generated by clustering topological distances for the major lineages. We viewed the Kimball et al. [52] supertree, which is a summary of phylogenomic studies, as an estimate of the species tree and included for comparison to the mitogenomic trees. The parenthetical number that follows each mitochondrial tree is the matching distance to the Kimball supertree. To facilitate visualization, the root of the tree has been placed at the midpoint. The complete distance matrix is available in Supplementary File S4.

Discussion
We addressed three hypotheses related to the potential relationship between the structure of mitochondrially-encoded proteins and the behavior of phylogenetic analyses, using a dataset comprising all 12 heavy strand-encoded proteins from 420 bird species. First, we corroborated our hypothesis that the relative exchangeabilities and equilibrium frequencies of amino acids would differ between TM and ExM environments. We also Figure 6. Dendrogram generated by clustering topological distances for the major lineages. We viewed the Kimball et al. [52] supertree, which is a summary of phylogenomic studies, as an estimate of the species tree and included for comparison to the mitogenomic trees. The parenthetical number that follows each mitochondrial tree is the matching distance to the Kimball supertree. To facilitate visualization, the root of the tree has been placed at the midpoint. The complete distance matrix is available in Supplementary File S4.

Discussion
We addressed three hypotheses related to the potential relationship between the structure of mitochondrially-encoded proteins and the behavior of phylogenetic analyses, using a dataset comprising all 12 heavy strand-encoded proteins from 420 bird species. First, we corroborated our hypothesis that the relative exchangeabilities and equilibrium frequencies of amino acids would differ between TM and ExM environments. We also found evidence that the bird mtTM model exhibited similarities to a general model of TM helix evolution (JTTtm). Moreover, the observed similarities between the bird mtTM and JTTtm models conformed to our expectations based on the analyses of buried versus solvent-exposed residues in globular proteins [31]. We did not corroborate our second hypothesis, that phylogenetic analyses of ExM loops would exhibit better performance in terms of topological estimation than analyses of TM helices (based on the NB observations). We found that some a priori expected clades emerged only in analyses of ExM sites and that others emerged only in analyses of TM sites. The overall support for many clades was also quite low. Third, we hypothesized that distinct topological signals would emerge in phylogenetic analyses of each data type. Although the trees based on each data type differed (Figure 4), it seems reasonable to postulate that stochastic error can explain the observed incongruence between the data types. The broad distribution of topological distances in the data subdivision test ( Figure 5) suggests that stochastic error played a large role in shaping the differences among trees. Overall, we concluded that the best models for TM and ExM sites were very different but found little or no evidence for topological data type effects in the mitochondrially-encoded proteins of birds.

Data Type Effects and Process Partitions
It has long been appreciated that patterns of evolution are heterogeneous, with distinct subsets of the genome and suites of morphological characters having the potential to exhibit different patterns of evolution. Bull et al. [11] defined process partitions as subsets of characters in a larger phylogenetic data matrix that evolved according to rules that differ from the other subset(s) in some demonstrable way. They provided a number of examples, such as (1) codon positions; (2) coding versus non-coding regions; (3) different genes and different regions within genes (including regions defined by the three-dimensional protein structure); (4) stems versus loops in ribosomal RNAs; and (5) nuclear versus organellar genes. As described in the introduction, the data type effects idea modifies this in two ways. The first is that data type effects exclude cases where discordance among gene trees can provide a simple explanation for any observed incongruence. Reddy et al. [10] explicitly excluded sex chromosome versus autosome comparisons from data type effects because different gene tree spectra are expected in such a comparison. Thus, it would certainly be inappropriate to view differences between a tree based on analyses of multiple nuclear loci and a tree based on a large non-recombining region such as the avian mitogenome [87,88] as a data type effect. On the other hand, it is reasonable to describe incongruence among estimates of phylogeny obtained using different subsets of sites in organelle genomes (or sex chromosomes) as data type effects.
The second criterion for data type effects is that multiple independent samples of each data type converge on trees in similar parts of tree space. It difficult to test this criterion in the same way as Reddy et al. [10] given the size of vertebrate mitogenomes, although observing similar topologies in jackknifed subsets of the TM and ExM sites would corroborate the hypothesis that different signals are associated with TM versus ExM sites. However, a prerequisite for such a test would be finding that the trees estimated using the TM and ExM sites are different enough to define two distinct parts of tree space. For that to be true the distance between the TM and ExM trees should exceed the expected distances between pairs of trees estimated using random subsets of the complete data matrix identical in size to the TM and ExM subsets; we did not meet that criterion.
A similar criterion can be used to judge distances between models, although such a test might not appear to yield information beyond the information available from standard model selection criteria such as the BIC. However, a random subdivision test might have an advantage relative to criteria such as the BIC for protein models. Most models of protein sequence evolution, such as the Dayhoff/PAM [89], JTT [90], LG [91], and mtVer [69] models, are fixed R matrices estimated based large training dataset, so they have no free R matrix parameters. In contrast, the GTR 20 model is very parameter-rich (it has 189 free R matrix parameters). However, many R matrix parameters are highly constrained (e.g., amino acid substitutions that require multiple nucleotide changes will have R matrix parameters equal to or close to zero). However, fixing those parameters at a value of zero is not a good solution; Kosiol et al. [92] and Pandey and Braun [31] used very different analytical frameworks but both showed that some amino acid substitutions that require multiple nucleotide changes are associated with values much larger than zero. Thus, when faced with the question of whether a potentially heterogeneous protein dataset is best described by a single R matrix or multiple R matrices, one may find conditions where optimizing all GTR 20 model parameters, including those constrained to be close to zero, cannot be justified using the BIC. However, a few important parameters might have very different values; dataset subdivision provides a simple method to determine whether this is the case. In this study, both the BIC and random subdivision corroborated the hypothesis that the best models for TM and ExM sites were significantly different whereas random subdivision revealed that topological data type effects are either very weak or non-existent.

Models of Transmembrane Protein Evolution and the NB Hypothesis
The new models of mitochondrial protein evolution we developed exhibit patterns consistent with the "rule of opposites," described by Pandey and Braun [31] for buried versus solvent-exposed residues in globular proteins. The rule of opposites is a statement that the most exchangeable amino acids in a specific structural environment are the less common amino acids in that environment. In this study, polar-polar exchanges were associated with the most elevated relative exchangeabilities in both of our new TM models (bird mtTM and JTTtm) and hydrophobic-hydrophobic exchangeabilities were the most elevated in mtExM. The rule of opposites applies to relative exchangeabilities (R matrix parameters) and not to instantaneous rates (Q matrix parameters). Therefore, the rule of opposites could reflect, at least in part, the time reversibility constraint. Pairs of rare amino acids require large exchangeabilities to explain even modest instantaneous rates of change between those amino acids. Pandey and Braun [31] proposed two mutually-exclusive verbal models regarding protein evolution relevant to the rule of opposites: (1) amino acids that are rare in a specific environment would not be exchangeable because they are necessary for specific functions; and (2) exchanges between pairs of amino acids that are rare in a specific environment are actually common (relative to their frequency) as long as the physicochemical nature of the amino acid is conserved. If the first model was correct it is necessary to invoke the high variance of R matrix parameters for pairs of low frequency amino acids. However, the first hypothesis also predicts that models based on some training datasets would not show evidence of the rule of opposites, at least for some exchangeabilities. Pandey and Braun [31] argued that their results for globular proteins, where they estimated parameters from seven different training datasets, favored the second model. This study shows that a similar pattern emerges in the bird mtTM model and in the more general JTTtm models, further corroborating the second verbal model.
Although we were able to improve the fit of evolutionary models to mitochondriallyencoded proteins by considering TM and ExM sites separately, we interpret the results of the mixture model analyses as evidence that there is substantial heterogeneity within each data type. All of the mitochondrially-encoded proteins of vertebrates are subunits within large multiprotein complexes that include nuclear-encoded, mitochondrially-localized proteins. This could cause some sites in the ExM loops to evolve under rules similar to those for buried sites in globular proteins, reflecting their contacts with other subunits. Since buried sites in globular proteins are enriched for hydrophobic amino acids [17,93] the existence of these sites could explain both the elevated estimate of the mtTM component mixture weight and the large contribution of the bird mtTM mixture component to the site likelihoods for some ExM sites (Table 2). Regardless, that heterogeneity suggests further improvements to models of sequence evolution have the potential to be useful, both for efforts to understand patterns of molecular evolution and for improving estimates of phylogenetic trees.
The topological distances between our estimates of the avian mitogenomic tree and the likely species tree (represented by the Kimball supertree; see Figure 6) were very high, ranging from 31% to 47% of the expected distance between pairs of random trees. The largest distance among estimated mitochondrial trees was even higher (the distance between the ExM amino acid tree and the three-partition RY tree for all sites was 250, 58% of the median distance between pairs of random trees). The simplest interpretation of these results is that they further emphasize the role of stochastic error in our estimates of the mitogenomic tree of birds. The clustering of nucleotide trees in tree space is likely due to the influence of information from synonymous substitutions on topology. The observation that there were three clusters (TM, ExM, and all sites) within the nucleotide trees suggests that deviations from stationarity in base composition did not lead to a strong topological signal. The hypothesis that the shifts base composition led to a strong topological signal would predict two clusters, one for four state data and one for two state6, because RYcoding reduces deviations from stationarity [45,94,95] under most conditions. Although we believe that our "tree-of-trees" is useful, we emphasize that it is simply a tool to reduce high-dimensional data (topological distances among trees) to facilitate visualization. Thus, the clustering of the ExM trees with the Kimball supertree in the dendrogram should not be overinterpreted. Although it could indicate that analyses of ExM sites perform better than those of TM sites (which would be consistent with the prediction based on NB), the long terminal branches provide evidence that any such effect is weak. Overall, the structure of the tree-of-trees supports two conclusions: (1) analyses of all datasets are very sensitive to the details of analytical methods (compare the distances between trees estimated using mtVer versus the optimized mtTM and mtExM models in Figure 6); and (2) stochastic error plays a large role in our estimates of the mitogenomic tree.
Using non-historical topological signals to study molecular evolution has one potential advantage over methods that focus on parameter estimation (e.g., the rate matrices in this study). If we assume model misspecification leads to non-historical signals, then it becomes possible to identify sites with a poor fit to the model used for the analysis finding sites associated with the unexpected topological signal. It is challenging to identify sites with poor fit to a model of sequence evolution in absolute (rather than relative) terms [96,97], making approaches that are able to highlight sites characterized by model violations useful. This was our goal when we searched for a topological data type effect associated with structural environments in mitochondrially-encoded proteins.
There are also challenges associated with using non-historical topological signals to study molecular evolution. First, there is always some uncertainty in empirical phylogenies, making it difficult to identify the true historical signal. Discordance among gene trees further complicates this issue because a tree that conflicts with a "known" species tree can also be explained by gene tree-species tree discordance. Second, the best method to reveal non-historical signals is unclear. NB used the MP criterion, a computationally efficient method [98] with a simple biological interpretation (MP treelength is the minimum number of changes for a character given a specific tree topology). However, the model implicit in MP has troubling mathematical properties when it is used with molecular data; (Holder et al. [99] describes the problems associated with branch lengths in MP-equivalent models). This motivated us to use a standard ML framework. Third, misleading signals, such as those identified by NB, might emerge only at certain depths in the tree of life. Our taxon sampling largely limited our topological assessments to clades that diversified in the lower Paleogene or upper Cretaceous (see Field et al. [100]); biases that might appear at other depths in the tree would not be detectable. Finally, substitutions responsible for misleading signals accumulate in a stochastic manner, just like substitutions responsible for historical signals. Even if one knew that a specific tree topology and evolutionary model is misleading in expectation (given a specific analytical method), analysis of a finite sample of sites generated under that model might not show evidence of the misleading topological signal (Kim [101] presents an example of a topology + model that exhibits this behavior in Figure 13 of that publication). However, that scenario would be expected to yield a dataset with very weak non-historical signals. Since it is impossible to distinguish a weak misleading signal from simple stochastic error, this could be the case for the avian mitogenomic tree, although that scenario requires more assumptions than a scenario involving stochastic error alone.

Implications for Avian Systematics and Evolution
Two additional questions emerge if we shift our focus from molecular evolution to the study of avian biodiversity: (1) What information about avian evolution does an accurate estimate of the mitogenomic tree of birds provide? (2) Have our analyses generated an accurate estimate of the avian mitogenomic tree? The first question is especially important in the era of genomics; whole-genome sequences for birds are now accumulating at an ever accelerating pace [102,103] and those data are being used to revolutionize phylogenetics [61]. Ultimately, the mitogenomic tree is a single gene tree and discordance among gene trees is known to be ubiquitous [6]. This is especially true for the diversification of avian orders at the base of Neoaves [12,104,105]. However, the mitogenomic tree is also an unusual gene tree because it is expected to be more congruent with the species tree than the average nuclear gene tree [106] and, when it differs from the species tree, that discordance can have important biological implications.
The unusual nature of the mitochondrial gene tree is likely to reflect, in large part, the maternal inheritance of the mitogenome. Maternal inheritance is expected to reduce the effective population size of the mitogenome relative to nuclear genes under many circumstances [107]. The ZW sex chromosome system of birds probably has an additional impact on avian mitochondrial population biology; Berlin et al. [108] suggested that Hill-Robertson interference [109] between the W chromosome and mitogenome can explain the low intraspecific variation observed in avian mitogenomes. Hickey [110] and Lane [111] suggested the opposite pattern (that selection on mitogenome leads to low W chromosome variation) is more likely; however, the locus of selection is actually irrelevant for birds because both scenarios reduce the effective population size of the mitogenome and therefore increase the likelihood of congruence with the species tree. In fact, both scenarios could be true in that selection on either the mitogenome or the W chromosome is expected to lead to selective sweeps that reduce variation on both genetic elements (obviously, the locus of selection would be relevant in taxa with different sex determination systems). On the other hand, some analyses support instances of genuine discordance between the mitogenomic tree and the species tree [112][113][114]. Although incomplete lineage sorting is likely to explain some instances of mitochondrial incongruence, mitochondrial capture probably explains many instances of discordance between the mitogenomic tree and the species tree [115]. Mitochondrial capture is likely to have a functional basis; introgression of a mitochondrial genotype is favored if it is better adapted to the local environment than the genotype of the recipient and/or when the mitochondrial genotype of the recipient taxon has a high mutational load. This creates two situations: (1) the true mitogenomic tree typically matches the species tree more closely than the true gene tree for a typical nuclear locus; and (2) genuine discordance between the species tree and mitogenomic tree can indicate interesting biological processes. Thus, an accurate estimate of the mitogenomic tree is likely to provide interesting information.
The second question was whether the evidence suggested that we were able to generate an accurate estimate of the mitogenomic tree; obviously, we were unable to do so. Although the incongruence among our estimates of mitochondrial phylogeny speak volumes, an even bigger problem is that concordance with the likely avian species tree is approximately evenly split between TM and ExM sites. This evaluation of the performance of analyses using a specific data type is predicated on the assumption that congruence between the estimated mitochondrial tree and the likely species tree indicates better performance. Although it is possible that any specific example of congruence is coincidental it is a virtual certainty that this will be true on average, especially in cases where the branch uniting a group is long in terms of the multispecies coalescent. For example, the branch uniting Notopalaeognathae is known to be long based on retroelement insertion data [116] (also note the high estimate of the concordance factor in Smith et al. [117]). Thus, one can place a very high prior probability that the true mitogenomic tree includes that clade and further leads us to the conclusion that analyses of the TM sites are correct (and those of ExM sites are incorrect) in this case. On the other hand, Mirandornithes is also recovered in a large number of individual gene trees [12,118], indicating that the branch uniting it is long in coalescent units. However, in this case, it is the ExM sites that yield a tree with Mirandornithes and TM amino that fail to do so (although analyses of TM nucleotides do yield the clade). We chose these examples because both are clades that appear in many nuclear gene trees and therefore it is very likely that the relevant clades are in the true mitogenomic tree. Indeed, it is likely that many of the clades that are present in at least some estimates of the mitogenomic tree as well as the likely species tree are present in the true mitogenomic tree. However, there is no clear pattern relating the topological signal in different structural environments to clades that are present in the species tree. This makes it impossible to assess the evidence for clades without reference to the species tree; this makes it impossible to identify genuine discordance between the true mitogenomic tree and the species tree.

Conclusions
The central conclusion of this study is that the best-fitting models of sequence evolution for TM versus ExM sites differ substantially but the tree topologies for those two data types exhibit few, if any, significant differences. Thus, we did not corroborate the NB hypothesis for the bird tree, at least for the parts of the tree we could examine given our taxon sample. Nevertheless, the conclusion that there are significant differences between the best-fitting models for TM and ExM sites suggests that it would be wise to incorporate information about these model differences into analyses of mitochondrial data. In general, better-fitting models will yield more accurate estimates of phylogeny and it seems reasonable to assert that better models of mitochondrial protein evolution will be useful in at least some parts of the tree. Even if the direct improvements to estimates of the topology for the mitogenomic tree are limited, incorporating differences related to protein structure into models of mitochondrial sequence evolution is likely to improve studies focused on shifts in the strength of purifying selection [62] or those focused on positive selection and convergence in mitochondrial proteins [119][120][121].
Despite our focus on a single gene tree, we believe our results have implications for the theory and practice of phylogenomics. Most modern phylogenomic studies combine gene trees estimated using many different loci to generate the species tree using summary coalescent methods, such as the method in the program ASTRAL [122] (although most studies also present trees estimated using other methods). Summary coalescent methods are unbiased when two conditions are met: (1) conflicts among gene trees reflect the multispecies coalescent; and (2) true gene trees are used as input [123]. However, they can be sensitive to gene tree estimation error [124][125][126][127]. Even when the species tree generated by a summary coalescent method has the correct topology, low-quality input gene trees lead to the underestimation of coalescent branch lengths [128]. This raises a profound question: how accurate are our estimates of gene trees? Although simulations provide some guidance, it seems likely that trees estimated from empirical data are often less accurate than trees based on simulated data. This study suggests that most estimates of the avian mitogenomic tree are inaccurate but many of these conflicts were uncovered because we incorporated information about mitochondrial protein structure. However, we have much less information about most loci used to generate gene trees. This study suggests that it would be valuable to incorporate more detailed information to better assess the accuracy of typical nuclear gene trees; structure is one such source of information for gene trees estimated using protein-coding regions.