# Disjoint Tree Mergers for Large-Scale Maximum Likelihood Tree Estimation

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Materials and Methods

#### 2.1. Overview

#### 2.2. Methods

#### 2.2.1. Maximum Likelihood Codes

#### 2.2.2. Disjoint Tree Mergers Pipelines

- Step 1: compute a starting tree
- Step 2: use the starting tree to decompose the set of sequences into disjoint sets
- Step 3: construct the trees on the different sets using a selected phylogeny estimation method, and
- Step 4: merge the trees using a DTM method along with some auxiliary information (e.g., guide tree or distance matrix)

#### 2.2.3. TreeMerge

#### 2.2.4. Constrained-INC

#### 2.2.5. Guide Tree Merger (GTM)

#### 2.2.6. Steps 1–3 Pipeline Details

#### 2.3. Datasets

- 1000M1-HF. These 1000-sequence datasets are from a prior study [16], where half of the sequences have been fragmentary. We picked the first five replicates only.
- RNASim1000. We sampled five 1000-sequence subsets of the single million-sequence replicate from the RNASim million-sequence dataset studied in [28].
- Cox1-HET. This is a dataset developed explicitly for this study containing 2341 sequences, and described in detail below. We created 10 replicates.
- RNASim10K. We sampled 10,000-sequence subsets of the RNASim million-sequence datasets (ten replicates).
- RNASim50K. We used the same strategy as for the RNASim10K analysis but sampled 50,000-sequence subsets (ten replicates).

#### 2.3.1. 1000M1-HF

#### 2.3.2. RNASim1000

#### 2.3.3. Cox1-HET

#### 2.3.4. RNASim10K

#### 2.3.5. RNASim50K

#### 2.4. Computational Platform

#### 2.5. Evaluation Criteria

## 3. Results

#### 3.1. Experiment 1: Evaluating the Impact of Distance Matrix Calculation and Subset Size on DTM Pipelines

#### 3.2. Experiment 2: Evaluating the Impact of the StartingTree on DTM Pipelines

- A comparison of the three ML heuristics shows that RAxML generally produces the most accurate trees, followed by IQ-TREE, and then by FastTree; the exception is the RNASim1000 condition where all three methods obtain about the same accuracy and FastTree is slightly more accurate than the other methods (Table 3). In the model conditions without fragmentary sequences, IQ-TREE and RAxML were very close in accuracy (Table 3 and Table 4), but RAxML was clearly more accurate than IQ-TREE on the fragmentary model condition (Table 5). FastTree had particularly poor accuracy on fragmentary sequences, but it was also clearly less accurate than the other methods on the model condition with heterotachy (Table 4). RAxML is the slowest and FastTree is the fastest.
- For each DTM method, using IQ-TREE instead of FastTree for the starting tree nearly always improved the accuracy. The one case where we did not see this trend was for the Cox1-HET, where, using Constrained-INC with the FastTree starting tree, was slightly better than using the IQ-TREE starting tree (Table 4).
- The DTM methods produced trees with similar accuracy, with a small disadvantage to Constrained-INC when using the IQ-TREE starting tree.
- When using an IQ-TREE starting tree, most of the time in using a DTM pipeline was spent constructing the starting tree. Furthermore, the DTM methods are very fast when using FastTree starting trees and slower when using IQ-TREE starting trees (and, by design, the DTM methods must be slower than their starting trees). Finally, TreeMerge was slower than the other DTM methods. (These observations are based on comparing running times for DTM pipelines to the starting tree runtimes.)
- For every model condition, the TreeMerge and GTM pipelines produced more accurate trees than their starting trees, but the improvement was small for IQ-TREE starting trees (0.7% to 1.8%) and varied for FastTree starting trees (0.2% to 8.5%).
- For every model condition, the TreeMerge and GTM pipelines using IQ-TREE starting trees matched or improved both FastTree and IQ-TREE with respect to accuracy, with the improvement in accuracy, depending on the model condition (and ranging from 0.7% to 1.8%).
- The differences in accuracy using GTM or TreeMerge pipelines with FastTree or IQ-TREE starting trees were small (at most 0.3% in Table 3 and Table 4), with the exception of the 1000M1-HF datasets, where the difference was large (14.1% in Table 5). Hence, the choice of starting tree can matter for some model conditions.
- No DTM pipeline was consistently able to match the accuracy of RAxML. That is, for one model condition (RNASim1000), all of the DTM pipelines matched or improved on RAxML, but no DTM pipeline matched RAxML on the other two model conditions (Table 3). The difference in accuracy between RAxML and the GTM or TreeMerge pipelines using IQ-TREE starting trees was 0.5% for the Cox1-HET dataset and 3.5% for the 1000M1-HF datasets (Table 4 and Table 5).
- RAxML had a higher running time than these DTM pipelines, especially when the GTM pipeline is used. For example, the GTM pipeline completed in 1.2 h on the RNASim1000 datasets, 3.1 h on the Cox1-HET datasets, and 3.1 h on the 1000M1-HF datasets, whereas RAxML did not complete within 24 h on two of the model conditions and only completed in 7.1 h on the Cox1-HET datasets.

#### 3.3. Experiment 3: Evaluating Fast DTM Pipelines in Comparison to ML Codes on Ultra-Large Datasets

#### Summary of Observed Trends

## 4. Discussion

#### 4.1. Comparison of the Three ML Codes

#### 4.2. Comparison between DTM Pipelines

#### 4.3. Comparing DTM Pipelines to ML Heuristics

#### 4.4. Limitations of This Study

#### 4.5. Future Work

## 5. Conclusions

- When the dataset sizes and computational resources make it feasible to use RAxML-NG or IQ-TREE 2, these methods should be used rather than FastTree 2, due to their (generally) higher level of topological accuracy when analyzing datasets that evolve under non-i.i.d. models (which may be typical of biological datasets). The choice between IQ-TREE 2 and RAxML-NG is less clear, as the relative accuracy of these two methods depended on the model condition. However, when datasets have substantial numbers of fragmentary sequences, RAxML-NG may be preferable to IQ-TREE 2.
- For very large datasets where RAxML-NG and IQ-TREE 2 are not feasible to run, then FastTree 2 or one of the DTM pipelines that we presented could provide reasonable accuracy.

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## Abbreviations

DTM | Disjoint Tree Merger |

GTM | Guide Tree Merger |

GTR | Generalized Time Reversible |

GTR+G | GTR model with Gamma-distributed rate variation among sites |

ML | Maximum Likelihood |

## Appendix A. Additional Tables

**Table A1.**1000M1 Resolution. Here, ${e}_{1}$ is the number of internal edges in the “PIMT”, or the Potentially Inferrable Model Tree, where the zero-event edges are collapsed, and ${e}_{2}$ is the number of internal edges in the binary model tree. The mean ± standard deviation is shown in the Overall row.

Replicate | ${\mathit{e}}_{1}$ | ${\mathit{e}}_{2}$ | ${\mathit{e}}_{1}$/${\mathit{e}}_{2}$ |
---|---|---|---|

R0 | 996 | 997 | 0.999 |

R1 | 991 | 997 | 0.994 |

R2 | 993 | 997 | 0.996 |

R3 | 993 | 997 | 0.996 |

R4 | 991 | 997 | 0.994 |

Overall | 993 ± 2 | 997 ± 0 | 0.996 ± 0.002 |

**Table A2.**Comparing branch length estimation using RAxML-NG instead of PAUP* inside TreeMerge. Here we show the False Negative error rates of two different variants of TreeMerge, averaged over 10 replicates of Cox1-HET and 5 replicates of RNASim1000 and 1000M1-HF. Using RAxML for branch length calculation benefits run-time while maintaining the same level of tree error. A FastTree starting tree was used with centroid decompositions and maximum subset size 500. TreeMerge was provided with the topological distance matrix computed from the FastTree starting tree. The runtime reported is for the whole pipeline from constructing the starting tree to getting the final TreeMerge tree where RAxML was allowed to have 2 cores, constraint tree calculations for both TreeMerge variants were considered to have been run in parallel up to 16 cores. Otherwise the steps are sequential, including the starting tree calculation, constraint tree calculation, and TreeMerge.

Dataset | Method | FN | Runtime (h) |
---|---|---|---|

RNASim1000 | PAUP* TreeMerge | 14.7% | 2.3 |

RNASim1000 | RAxML TreeMerge | 14.7% | 1.2 |

1000M1-HF | PAUP* TreeMerge | 42.5% | 3.4 |

1000M1-HF | RAxML-NG TreeMerge | 42.5% | 2.5 |

Cox1-HET | PAUP* TreeMerge | 18.9% | 10.0 |

Cox1-HET | RAxML TreeMerge | 18.9% | 6.5 |

**Table A3.**Experiment 1: Impact on TreeMerge tree error (FN) rates from varying the distance matrix calculation technique and maximum subset size on the Cox1-HET datasets. The starting tree is computed using FastTree, the constraint trees are computed using IQ-TREE, and constraint trees are merged using TreeMerge with PAUP* branch lengths, using a distance matrix computed using one of the selected techniques. K20, F81, F84, and TN93 refer to distance calculations based on statistical models of sequence evolution; PDist refers to the p-distance between sequences (i.e., fraction of the alignment where two sequences differ); RY coding refers to coding all purines (A or G) as Rs and all pyrimidines (C or T) as Ys; and the remaining two (FastTree-topological and FastTree-branch-length) are distances produced using the FastTree guide tree. The most accurate results for each maximum subset size are boldfaced.

20 | 50 | 120 | |
---|---|---|---|

FastME-PDist | 31.7% | 24.6% | 21.3% |

FastME-RY-sym | 32.8% | 25.6% | 22.1% |

FastME-RY | 32.8% | 25.6% | 22.1% |

FastME-JC69 | 31.6% | 24.7% | 21.3% |

FastME-K2P | 32.1% | 25.2% | 21.6% |

FastME-F81 | 31.6% | 24.6% | 21.3% |

FastME-F84 | 32.2% | 25.2% | 21.7% |

FastME-TN93 | 32.4% | 25.4% | 21.7% |

FastTree-topological | 27.0% | 22.1% | 19.9% |

FastTree-branch-length | 27.0% | 22.1% | 19.9% |

**Table A4.**Experiment 1: Impact on Constrained-INC tree error of subset size and distance matrix calculation on the Cox1-HET datasets. Here we show the tree error (FN) rates of running the Constrained-INC pipelines (IQ-TREE used for the starting and guide tree, as well as for the constraint trees) for different maximum subset sizes (120 or 500) and distance matrix calculation (topological vs. branch-length distances) on the guide tree using IQ-TREE. The results are averaged over 10 replicates. The most accurate result for each maximum subset size is boldfaced.

Cox1-HET | |
---|---|

Constrained-INC-120-branch-length | 24.7% |

Constrained-INC-120-topological | 20.7% |

Constrained-INC-500-branch-length | 19.9% |

Constrained-INC-500-topological | 19.7% |

**Table A5.**Experiment 1: Impact on GTM tree error (FN) rates of varying the maximum subset size on the Cox1-HET datasets. The most accurate result is boldfaced.

20 | 50 | 120 | 500 | 1000 | |
---|---|---|---|---|---|

GTM | 24.3 | 21.0 | 19.4 | 18.8 | 18.9 |

## Appendix B. Additional Details on the Cox1-HET Simulation

## Appendix C. Commands and Codes Used in the Study

**time**version: GNU time 1.7

- Code Available at: https://ftp.gnu.org/gnu/time/
`/usr/bin/time -v <command to measure> 2> <stderr file> 1> <stdout file>`**FastTree**version: FastTree Version 2.1.10 Double precision (No SSE3), OpenMPCode Available at: http://www.microbesonline.org/FastTree`FastTreeMP -nt -gtr -gamma <input alignment>`**IQ-TREE**version: IQ-TREE multicore version 1.6.12Code Available at: http://www.iqtree.org/`iqtree -s <input alignment> -m GTR+F+G4 -pre <output prefix>`**IQ-TREE**version: IQ-TREE multicore version 2.0.6Code Available at: http://www.iqtree.org/`iqtree2 -s <input alignment> -nt AUTO -ntmax <num threads> -seed 10101 \``-m GTR+G -pre <output prefix>`**RAxML-NG**version: RAxML-NG v. 1.0.1Code Available at: https://github.com/amkozlov/RAxML-NG`raxmlng --msa <input alignment> --threads <num threads> --seed 10101 \``--model GTR+G --prefix <output prefix>`**Constrained-INC**commit id: 0b387d31a3874c60c284f6b619562a271bead0b2Code Available at: https://github.com/steven-le-thien/INC`constraint_inc -i <input matrix> -o <output prefix> -q subtree \``-g <input tree> -t <space separated list of constraint trees>`**GTM**commit id: 98d76bd2a553af0b8fa087f2ba0fedffa60c7b73Code Available at: https://github.com/vlasmirnov/GTM`python gtm.py -s <input tree> -t \``<space separated list of constraint trees> -o <output prefix>`**TreeMerge(PAUP*)**commit id: dee1ab9da49d36e1a83ebc4dc29b800f3574fe68The version of PAUP* is 4.0.Code Available at: https://github.com/ekmolloy/treemerge`python treemerge.py -s <input tree> -m <input matrix> -x \``<input matrix taxon list> -o <output prefix> -p <paup binary> \``-w <working directory> -t <space separated list of constraint trees>`**TreeMerge(RAxML-NG)**commit id: dee1ab9da49d36e1a83ebc4dc29b800f3574fe68Code Available at: https://github.com/minhyukpark/treemerge`python get_merge_list.py --starting-tree <input tree> --output-prefix \``<output prefix> -- <space separated list of constraint trees>``python setup_merger.py --starting-tree <input tree> --files-needed \``<output named files_needed from get_merge_list.py> --output-prefix \``<output prefix> --merger-choice njmerge --guide-choide induced``python run_merger.py --starting-tree <input tree> --files-needed \``<output named files_needed from get_merge_list.py> --output-prefix \``<output prefix> --merger-choice njmerge --guide-choice induced``python treemerge.py -s <input tree> -m <input matrix> -x \``<input matrix taxon list> -o <output prefix> -p <paup binary> -w \``<working directory> --mst <minimum spannig tree> -t \``<space separated list of constrait trees>`**Compare Trees**`python compare_two_trees.py -t1 <reference tree> \``-t2 <estimated tree>`**RNASim1000 Generation**commit id: d2a774f23d68b59436195ea05a2006bbd2ad0ff6Code Available at: https://github.com/MinhyukPark/QuickScripts`python random_seed_generator.py -n 5``python random_sample_sequence.py --input-sequence <rnasim 1million \``alignment> --output-sequence <output prefix> --num-sequence 1000 \``--seed <seed>``python induce_tree.py --input-tree <rnasim 1million model tree> \``--sequence-file <rnasim1000 input alignment> --output-file \``<output prefix>`**Centroid Decomposition**commit id: d2a774f23d68b59436195ea05a2006bbd2ad0ff6Code Available at: https://github.com/MinhyukPark/QuickScripts`python decompose.py --input-tree <input tree> --sequence-file \``<input alignment> --output-prefix <output prefix> --maximum-size \``<subset size>`

## References

- Tavaré, S. Some probabilistic and statistical problems in the analysis of DNA sequences. Lect. Math. Life Sci.
**1986**, 17, 57–86. [Google Scholar] - Roch, S. A short proof that phylogenetic tree reconstruction by maximum likelihood is hard. IEEE/ACM Trans. Comput. Biol. Bioinform.
**2006**, 3, 92–94. [Google Scholar] [CrossRef] - Stamatakis, A. RAxML version 8: A tool for phylogenetic analysis and post-analysis. Bioinformatics
**2014**, 30, 1312–1313. [Google Scholar] [CrossRef] - Kozlov, A.M.; Darriba, D.; Flouri, T.; Morel, B.; Stamatakis, A. RAxML-NG: A fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference. Bioinformatics
**2019**, 35, 4453–4455. [Google Scholar] [CrossRef] [PubMed][Green Version] - Minh, B.Q.; Schmidt, H.A.; Chernomor, O.; Schrempf, D.; Woodhams, M.D.; Von Haeseler, A.; Lanfear, R. IQ-TREE 2: New models and efficient methods for phylogenetic inference in the genomic era. Mol. Biol. Evol.
**2020**, 37, 1530–1534. [Google Scholar] [CrossRef] [PubMed][Green Version] - Swofford, D.L. PAUP* (*Phylogenetic Analysis Using PAUP), Version 4a161. 2018. Available online: http://phylosolutions.com/paup-test/ (accessed on 5 May 2021).
- Guindon, S.; Dufayard, J.F.; Lefort, V.; Anisimova, M.; Hordijk, W.; Gascuel, O. New algorithms and methods to estimate maximum-likelihood phylogenies: Assessing the performance of PhyML 3.0. Syst. Biol.
**2010**, 59, 307–321. [Google Scholar] [CrossRef][Green Version] - Price, M.N.; Dehal, P.S.; Arkin, A.P. FastTree 2—Approximately maximum-likelihood trees for large alignments. PLoS ONE
**2010**, 5, e9490. [Google Scholar] [CrossRef] - Liu, K.; Linder, C.R.; Warnow, T. RAxML and FastTree: Comparing two methods for large-scale maximum likelihood phylogeny estimation. PLoS ONE
**2011**, 6, e27731. [Google Scholar] [CrossRef] - Zhou, X.; Shen, X.X.; Hittinger, C.T.; Rokas, A. Evaluating fast maximum likelihood-based phylogenetic programs using empirical phylogenomic data sets. Mol. Biol. Evol.
**2018**, 35, 486–503. [Google Scholar] [CrossRef] [PubMed][Green Version] - Nguyen, L.T.; Schmidt, H.A.; Von Haeseler, A.; Minh, B.Q. IQ-TREE: A fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol. Biol. Evol.
**2015**, 32, 268–274. [Google Scholar] [CrossRef] [PubMed] - Hodcroft, E.B.; De Maio, N.; Lanfear, R.; MacCannell, D.R.; Minh, B.Q.; Schmidt, H.A.; Stamatakis, A.; Goldman, N.; Dessimoz, C. Want to track pandemic variants faster? Fix the bioinformatics bottleneck. Nature
**2021**, 591, 30–33. [Google Scholar] [CrossRef] [PubMed] - Heath, T.A.; Hedtke, S.M.; Hillis, D.M. Taxon sampling and the accuracy of phylogenetic analyses. J. Syst. Evol.
**2008**, 46, 239–257. [Google Scholar] - Zhang, C.; Scornavacca, C.; Molloy, E.K.; Mirarab, S. ASTRAL-Pro: Quartet-based species-tree inference despite paralogy. Mol. Biol. Evol.
**2020**, 37, 3292–3307. [Google Scholar] [CrossRef] - Lees, J.A.; Kendall, M.; Parkhill, J.; Colijn, C.; Bentley, S.D.; Harris, S.R. Evaluation of phylogenetic reconstruction methods using bacterial whole genomes: A simulation based study. Wellcome Open Res.
**2018**, 3, 33. [Google Scholar] [CrossRef][Green Version] - Smirnov, V.; Warnow, T. Phylogeny estimation given sequence length heterogeneity. Syst. Biol.
**2021**, 70, 268–282. [Google Scholar] [CrossRef] [PubMed] - Sayyari, E.; Whitfield, J.B.; Mirarab, S. Fragmentary gene sequences negatively impact gene tree and species tree reconstruction. Mol. Biol. Evol.
**2017**, 34, 3279–3291. [Google Scholar] [CrossRef] [PubMed] - Zhang, Q.R.; Rao, S.; Warnow, T.J. New Absolute Fast Converging Phylogeny Estimation Methods with Improved Scalability and Accuracy. In Proceedings of the 18th International Workshop on Algorithms in Bioinformatics, WABI 2018, Helsinki, Finland, 20–22 August 2018; pp. 8:1–8:12. [Google Scholar] [CrossRef]
- Molloy, E.K.; Warnow, T. NJMerge: A Generic Technique for Scaling Phylogeny Estimation Methods and Its Application to Species Trees. In Comparative Genomics. RECOMB-CG 2018. Lecture Notes in Computer Science; Blanchette, M., Ouangraoua, A., Eds.; Springer: Cham, Switzerland, 2018; Volume 11183. [Google Scholar]
- Molloy, E.K.; Warnow, T. TreeMerge: A new method for improving the scalability of species tree estimation methods. Bioinformatics
**2019**, 35, i417–i426. [Google Scholar] [CrossRef] - Smirnov, V.; Warnow, T. Unblended disjoint tree merging using GTM improves species tree estimation. BMC Genom.
**2020**, 21, 1–17. [Google Scholar] [CrossRef] [PubMed][Green Version] - Maddison, W.P. Gene trees in species trees. Syst. Biol.
**1997**, 46, 523–536. [Google Scholar] [CrossRef] - Mirarab, S.; Warnow, T. FastSP: Linear time calculation of alignment accuracy. Bioinformatics
**2011**, 27, 3250–3258. [Google Scholar] [CrossRef] - Le, T.; Sy, A.; Molloy, E.K.; Zhang, Q.; Rao, S.; Warnow, T. Using Constrained-INC for large-scale gene tree and species tree estimation. IEEE/ACM Trans. Comput. Biol. Bioinform.
**2020**, 18, 2–15. [Google Scholar] [CrossRef] [PubMed] - Molloy, E.K.; Warnow, T. Statistically consistent divide-and-conquer pipelines for phylogeny estimation using NJMerge. Algorithms Mol. Biol.
**2019**, 14. [Google Scholar] [CrossRef] [PubMed][Green Version] - Robinson, D.; Foulds, L. Comparison of phylogenetic trees. Math. Biosci.
**1981**, 53, 131–147. [Google Scholar] [CrossRef] - Liu, K.; Warnow, T.J.; Holder, M.T.; Nelesen, S.M.; Yu, J.; Stamatakis, A.P.; Linder, C.R. SATé-II: Very fast and accurate simultaneous estimation of multiple sequence alignments and phylogenetic trees. Syst. Biol.
**2012**, 61, 90. [Google Scholar] [CrossRef] [PubMed] - Mirarab, S.; Nguyen, N.; Guo, S.; Wang, L.S.; Kim, J.; Warnow, T. PASTA: Ultra-large multiple sequence alignment for nucleotide and amino-acid sequences. J. Comput. Biol.
**2015**, 22, 377–386. [Google Scholar] [CrossRef] - Smirnov, V.; Warnow, T. MAGUS: Multiple Sequence Alignment using Graph Clustering. Bioinformatics
**2020**. [Google Scholar] [CrossRef] - Liu, K.; Raghavan, S.; Nelesen, S.; Linder, C.R.; Warnow, T. Rapid and accurate large-scale coestimation of sequence alignments and phylogenetic trees. Science
**2009**, 324, 1561–1564. [Google Scholar] [CrossRef][Green Version] - Lopez, P.; Casane, D.; Philippe, H. Heterotachy, an important process of protein evolution. Mol. Biol. Evol.
**2002**, 19, 1–7. [Google Scholar] [CrossRef] [PubMed][Green Version] - Fletcher, W. INDELible v1.03 Control File Tutorial. Available online: http://abacus.gene.ucl.ac.uk/software/indelible/tutorial/nucleotide-branch.shtml (accessed on 5 May 2021).
- Fletcher, W.; Yang, Z. INDELible: A flexible simulator of biological sequence evolution. Mol. Biol. Evol.
**2009**, 26, 1879–1888. [Google Scholar] [CrossRef][Green Version] - Guo, S. CIPRES Simulation Data. Available online: https://kim.bio.upenn.edu/software/csd.shtml (accessed on 5 May 2021).
- Lefort, V.; Desper, R.; Gascuel, O. FastME 2.0: A comprehensive, accurate, and fast distance-based phylogeny inference program. Mol. Biol. Evol.
**2015**, 32, 2798–2800. [Google Scholar] [CrossRef][Green Version] - Puillandre, N.; Brouillet, S.; Achaz, G. ASAP: Assemble species by automatic partitioning. Mol. Ecol. Resour.
**2021**, 21, 609–620. [Google Scholar] [CrossRef] [PubMed] - Kimura, M. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol.
**1980**, 16, 111–120. [Google Scholar] [CrossRef] [PubMed]

**Figure 1.**DTM Pipeline. Here, we show a step-by-step overview of the pipeline we study that uses Disjoint Tree Merger (DTM) methods to infer a tree. In Step 1, a tree is computed on the input multiple sequence alignment, which serves as a starting tree as well as providing auxiliary information used by the DTM. In Step 2, the starting tree is decomposed into multiple disjoint subsets; each subset defines a sub-alignment of the input alignment. In Step 3, a tree is inferred on each disjoint subset (i.e., alignment), which subsequently serve as constraint trees. In Step 4, these constraint trees are combined using a DTM (along with the auxiliary information provided by the starting tree) to form the final output tree.

**Figure 2.**Comparing GTM pipeline to ML methods for error rates. Error rates for trees computed using standard ML methods and a GTM pipeline, combining IQ-TREE constraint trees. The starting tree is IQ-TREE for the smaller datasets (i.e., 1000M1-HF, RNASim1000, and Cox1-HET) and FastTree on the larger datasets (i.e., RNASim10k and RNASim50k). All of the ML methods were given true alignments and allowed to run to completion, except for RAxML, which was limited to 24 h on the 1000M1-HF, RNASim1000, and Cox1-HET datasets and limited to 168 h on the RNASim10k and RNASim50k datasets. On the RNASim50k datasets, IQ-TREE failed to return a tree and RAxML returned a tree with an error rate of 100%, and so the results for RAxML and IQ-TREE are not shown for that condition.

**Figure 3.**Comparing the GTM pipeline to ML methods for running time. Here, we show the runtime (hours) of computing the trees using three different ML methods and a GTM pipeline. For the 1000M1-HF, RNASim1000, and Cox1-HET datasetes, the GTM pipeline uses IQ-TREE starting tree, IQ-TREE constraint trees, and IQ-TREE guide tree. For the RNASim10k and RNASim50k datasets, the GTM pipeline uses FastTree starting tree, IQ-TREE constraint trees, and FastTree guide tree. All of the ML methods were given true alignments and allowed to run to completion except for RAxML, which was limited to 24 h on the 1000M1-HF, RNASim1000, and Cox1-HET datasets, and limited to 168 h on the RNASim10k and RNASim50k datasets. On the RNASim50k datasets, the IQ-TREE failed to return a tree and RAxML returned a tree with an error rate of 100%, and so results for RAxML and IQ-TREE are not shown for that condition.

**Table 1.**Description of DTM methods used in this study. DTMs are used inside divide-and-conquer pipelines, as described in Figure 1, where they are used to combine trees on disjoint subsets (requiring that the subset trees are induced in the final tree). Each requires some auxiliary information, described here in terms of a user-provided starting tree that is estimated from the input multiple sequence alignment.

Method | Description | Reference |
---|---|---|

Constrained-INC | The Constrained-INC method uses the starting tree to compute a matrix of leaf-to-leaf distances (either topological distances or using branch lengths). Constrained-INC builds the tree incrementally, and the distances are used to define the order in which the taxa are added to the growing tree. When adding a taxon to the growing tree, a selected set of quartet trees (extracted from the starting tree) is used to vote on where to add the new taxon. Constrained-INC runs in polyomial time and allows full blending. | [24] |

TreeMerge | TreeMerge uses the starting tree to compute a matrix of leaf-to-leaf distances (either topological distances or using branch lengths). It then combines selected pairs of these trees using NJMerge [25] (an early DTM), thus producing larger trees that are also treated as constraint trees. These larger trees are then merged together if they overlap, using branch lengths. TreeMerge runs in polynomial time. TreeMerge allows partial blending (since NJMerge allows full blending, but the second stage, where larger trees are superimposed, only enables partial blending). | [20] |

GTM | The Guide Tree Merger (GTM) combines the constraint trees by finding ways to add edges between the constraint trees to minimize the total Robinson-Foulds [26] distance to the constraint trees. This is in general an NP-hard optimization problem, but can be solved in polynomial time when the output tree is obtained by adding edges between the constraint trees. By design, GTM does not perform any blending. | [21] |

Name | # Seqs | # Reps | Description |
---|---|---|---|

1000M1-HF | 1000 | 5 | Introduced in [16]. Created by making the 1000M1 model condition from [30] fragmentary (half the sequences with length 25% of the original median sequence length). Sequence evolution under this model uses the standard GTR+G substitution model, enhanced with indels. |

RNASim1000 | 1000 | 5 | Randomly sampled for this study from the RNASim million-sequence simulation [28]. The RNASim sequence evolution model is non-standard, including selection to maintain the RNA structure. |

Cox1-HET | 2341 | 10 | Created for this study by evolving sequences down a model tree with different evolutionary parameters for different areas of the tree. This model condition involves substantial heterogeneity across the tree, including heterotachy. |

RNASim10k | 10,000 | 10 | Randomly sampled for this study from the RNASim million-sequence simulation. See comments above for RNASim1000. |

RNASim50k | 50,000 | 10 | Randomly sampled for this study from the RNASim million-sequence simulation. See comments above for RNASim1000. |

**Table 3.**Results on the RNASim1000 datasets. Tree error (FN) rates and running time for ML methods and DTM pipelines on the RNASim1000 model condition (averaged across five replicates). The DTM pipelines use the starting tree as the guide tree (with either FastTree or IQ-TREE), and decompose to a maximum subset size 500. If a pipeline requires a distance matrix, as is the case with Constrained-INC and TreeMerge, the topological distance matrix is obtained from the starting tree. The asterisk (*) for RAxML running time indicates that it had not been completed by that time, and we used the best scoring tree it found in that time period. The most accurate results are boldfaced.

Method | Running Time (h) | Tree Error |
---|---|---|

FastTree | 0.1 | 14.9% |

IQ-TREE | 1.0 | 15.1% |

RAxML | 24 * | 15.1% |

DTMs, FastTree starting tree | ||

Constrained-INC | 0.4 | 15.1% |

GTM | 0.4 | 14.7% |

TreeMerge | 1.2 | 14.7% |

DTMs, IQ-TREE starting tree | ||

Constrained-INC | 1.2 | 14.5% |

GTM | 1.2 | 14.4% |

TreeMerge | 1.9 | 14.4% |

**Table 4.**Results on the Cox1-HET datasets. Tree error (FN) rates and running time for existing ML methods and DTM pipelines on the Cox1-HET model condition (averaged across 10 replicates). The DTM pipelines use the starting tree as a guide tree with two options for guide trees (FastTree or IQ-TREE), and then decompose to maximum subset size 500. If a pipeline requires a distance matrix, as is the case with Constrained-INC and TreeMerge, the topological distance matrix is obtained from the starting tree. The most accurate result is boldfaced.

Method | Running Time (h) | Tree Error |
---|---|---|

FastTree | 0.06 | 23.9% |

IQ-TREE | 2.3 | 19.6% |

RAxML | 7.1 | 18.2% |

DTMs, FastTree starting tree | ||

Constrained-INC | 0.8 | 18.9% |

GTM | 0.8 | 18.9% |

TreeMerge | 6.5 | 18.9% |

DTMs, IQ-TREE starting tree | ||

Constrained-INC | 3.1 | 19.7% |

GTM | 3.1 | 18.7% |

TreeMerge | 8.5 | 18.7% |

**Table 5.**Results on the 1000M1-HF datasets. The tree error rates (FN) and running time for existing ML methods and DTM pipelines on the 1000M1-HF model condition (averaged across five replicates). The DTM pipelines use the starting tree as a guide tree with two options for guide trees (FastTree or IQ-TREE), and decompose to maximum subset size 500. If a pipeline requires a distance matrix, as is the case with Constrained-INC and TreeMerge, the topological distance matrix is obtained from the starting tree. The asterisk (*) for RAxML running time indicates that it had not completed by that time, and we used the best scoring tree that it found in that time period. The most accurate result is boldfaced.

Method | Running Time (h) | Tree Error |
---|---|---|

FastTree | 0.03 | 50.9% |

IQ-TREE | 2.5 | 30.2% |

RAxML | 24 * | 24.9% |

DTMs, FastTree starting tree | ||

Constrained-INC | 1.0 | 42.4% |

GTM | 1.0 | 42.4% |

TreeMerge | 2.5 | 42.5% |

DTMs, IQ-TREE starting tree | ||

Constrained-INC | 3.1 | 28.6% |

GTM | 3.1 | 28.4% |

TreeMerge | 3.9 | 28.5% |

**Table 6.**Results on the RNASim10k datasets The average running time (in hours, assuming use of 16 cores) and tree error (FN) rates over 10 replicates. Each method was given 64 GB. The asterisk (*) for RAxML running time indicates that it had not completed by that time, and we used the best scoring tree that it found in that time period. The most accurate result is boldfaced.

Method | Running Time (h) | Tree Error |
---|---|---|

FastTree | 2.0 | 10.8 |

IQ-TREE | 74.6 | 10.9 |

RAxML | 168 * | 12.3 |

DTMs, FastTree starting tree | ||

Constrained-INC | 4.3 | 10.5 |

GTM | 3.5 | 10.1 |

**Table 7.**Results on the RNASim50k datasets. The average running time (in hours) and tree error (FN) rates over 10 replicates. Running time assumes that each method was run with 16 cores. Each method was given 64 GB. “N.A.” indicates a failure to produce a tree due to a memory allocation problem. The asterisk (*) for RAxML running time indicates that it had not been completed by that time, and we used the best scoring tree it found in that time period. The most accurate result is boldfaced.

Method | Running Time (h) | Tree Error |
---|---|---|

FastTree | 8.3 | 8.0 |

IQ-TREE | N.A. | N.A. |

RAxML | 168 * | 100.0 |

DTMs, FastTree starting tree | ||

GTM | 14.1 | 7.5 |

Constrained-INC | N.A. | N.A. |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Park, M.; Zaharias, P.; Warnow, T. Disjoint Tree Mergers for Large-Scale Maximum Likelihood Tree Estimation. *Algorithms* **2021**, *14*, 148.
https://doi.org/10.3390/a14050148

**AMA Style**

Park M, Zaharias P, Warnow T. Disjoint Tree Mergers for Large-Scale Maximum Likelihood Tree Estimation. *Algorithms*. 2021; 14(5):148.
https://doi.org/10.3390/a14050148

**Chicago/Turabian Style**

Park, Minhyuk, Paul Zaharias, and Tandy Warnow. 2021. "Disjoint Tree Mergers for Large-Scale Maximum Likelihood Tree Estimation" *Algorithms* 14, no. 5: 148.
https://doi.org/10.3390/a14050148