This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

Distance measures between trees are useful for comparing trees in a systematic manner, and several different distance measures have been proposed. The triplet and quartet distances, for rooted and unrooted trees, respectively, are defined as the number of subsets of three or four leaves, respectively, where the topologies of the induced subtrees differ. These distances can trivially be computed by explicitly enumerating all sets of three or four leaves and testing if the topologies are different, but this leads to time complexities at least of the order ^{3} or ^{4} just for enumerating the sets. The different topologies can be counted implicitly, however, and in this paper, we review a series of algorithmic improvements that have been used during the last decade to develop more efficient algorithms by exploiting two different strategies for this; one based on dynamic programming and another based on coloring leaves in one tree and updating a hierarchical decomposition of the other.

Evolutionary relationships are often represented as trees; a practice not only used in biology, where trees can represent, for example, species relationships or gene relationships in a gene family, but also used in many other fields studying objects related in some evolutionary fashion. Examples include linguistics, where trees represent the evolution of related languages, or archeology, where trees have been used to represent how copies of ancient manuscripts have changed over time. Common for most such fields is that the true tree relationship between objects is never observed, but must be inferred, and depending on both the data available and the methods used for the inference, the inferred trees will likely be slightly different.

Tree distances provide a formal way of quantifying how similar two trees are and, for example, determine if two trees are significantly similar or no more similar than could be expected by chance. Many different distances have been defined on trees. Most only consider the tree topology,

The Robinson-Foulds distance considers all the ways the leaf labels can be split into two sets and counts how often only one of the trees has an edge matching this split. Informally, this essentially means that it counts how often the two trees have the “same edge” and how often an edge in one tree has no counterpart in the other. Edges are arguably the simplest element of the topology of a tree, and not surprisingly, the Robinson-Foulds distance is both the most frequently used distance measure and the distance measure that can be computed with the optimal algorithmic complexity,

The triplet and quartet distances enumerate all subsets of three or four leaves, respectively, and count how often the topologies induced by the three or four leaves are the same in the two trees. The triplet distance is intended for rooted trees, where the triplet topology is the smallest informative subtree (for unrooted trees, all subtrees with three leaves have the same topology), while the quartet distance is intended for unrooted trees, where the quartet topology is the smallest informative subtree. Whether it is possible to compute the triplet and quartet distance in linear time is unknown. The fastest known algorithms have time complexity

In this paper, we will review the algorithmic development that led to these non-trivial running times, in particular, the development of algorithms for general (non-binary) trees to which the authors have contributed a number of papers. We will first formally define the triplet and quartet distance between two leaf-labeled trees. We then describe the state-of-the-art for binary trees with two different approaches to computing the distances: one based on dynamic programming with time complexity ^{2}) for both quartet and triplet distance and one based on coloring leaves in a tree traversal with complexity

Given two trees, _{1} and _{2}, each with

A triplet is a set, {

A quartet is a set, {

Different cases for triplet and quartet topologies.

The rightmost cases for triplets and quartets in _{1} and _{2}, of arbitrary degree can be partitioned into the five cases,

Cases for computing differences.

Since both distances are defined as the number of differing topologies, they can be computed as

The quartet and triplet distances are known to be more robust to small changes in the trees than other distance measures, including the Robinson-Foulds distance [

The parameterized triplet and quartet distance is defined by Bansal

In this section, we consider the binary case,

We will describe two algorithms for computing the quartet distance for binary trees, one based on dynamic programming and the other on coloring leaves and comparing topologies induced by the coloring. These are also the two approaches that we have used to handle general trees, and we will describe the extensions for general trees in the next section. Variations of the two approaches to compute the triplet distance have also been developed, but in our main exposition, we focus on the quartet distance.

The first algorithm, from Bryant

We first conceptually take all edges in the two trees and replace them with two oriented edges. This way, we can uniquely assign three subtrees to each oriented edge: _{1} behind the edge and _{2} and _{3} in front of the edge (see _{1}, _{2} and _{3} (or _{2} and _{3}). If a tree contains the quartet topology, _{1}, _{2} and _{3} (or _{2} and _{3}); and _{1}, _{2} and _{3} (or _{2} and _{3}) as in

Any quartet is claimed by exactly two edges.

We will consider these two

Given two trees, _{1} and _{2}, the algorithm now simply iterates through all edges, _{1} and _{2}, and counts how many oriented quartets, _{1} and _{2}. This number, denoted A(_{1},_{2}), can be computed as:
_{1} and _{2}. The number of shared quartets (and by extension, the quartet distance) can then be computed as:
^{2}).

Tables for |^{2}): if _{1} and _{2}, and _{1} and _{2}, then |_{1} ∩ _{1}| + |_{1} ∩ _{2}| + |_{2} ∩ _{1}| + |_{2} ∩ _{2}|. If

By modifying this algorithm slightly, it can also be used to compute the triplet distance between rooted binary trees in ^{2}) time.

Brodal ^{2} _{1}, _{2}) now denotes the number of oriented quartets (in the remainder of this section, we will use _{1} and _{2}, which induce the same quartet topology in both _{1} and _{2}. The two algorithms, however, only explicitly iterate over the nodes in _{1}, while for each _{1} ∈ _{1}, they compute _{2}. To realize this strategy, both algorithms use a coloring procedure in which the leaves of the two trees are colored using the three colors, _{1} ∈ _{1}, we say that _{1} is _{1} if the leaves in one of the subtrees all have the color, _{1} is colored according to a node, _{1} ∈ _{1}, then the set of quartets in _{1} that are compatible with this coloring is exactly the set of quartets associated with _{1} and, furthermore, if we color the leaves of _{2} in the same way as in _{1}, that the set of quartets, _{2} that are compatible with this coloring is exactly the set of quartets that are associated with _{1} _{1} and _{2}; thus,

Naively coloring the leaves of the two trees according to each inner node in _{1} and counting the compatible quartets in _{2} explicitly would take time ^{2}), as we would run through _{1} recursively and using the “smaller-half trick”, we can ensure that each leaf changes color only _{2} for each coloring, we use a hierarchical decomposition of _{2}. The main feature of this data structure is that we can decorate it in a way such that it can return the number of quartets in _{2} compatible with the current coloring in constant time. To achieve this, the hierarchical decomposition uses ^{2} ^{2}

The coloring procedure starts by rooting _{1} in an arbitrary leaf. It then for each inner node, _{1} ∈ _{1}, calculates the number of leaves, |_{1}|, in the subtree rooted at _{1}. This is done in a postorder traversal starting in the new (designated) root of _{1}, and the information is stored in the nodes. In this traversal, all leaves are also colored by the color, _{1} recursively, as described and illustrated in _{1}. In _{1}), respectively small(_{1}), denotes the child of _{1} that constitutes the root of the largest, respectively smallest, of the two subtrees under _{1}, measured on the number of leaves in the subtrees, and _{2} to retrieve the number of quartets in _{2} that are compatible with the current coloring. During the traversal, the following two invariants are maintained for all nodes, _{1} ∈ _{1}: (1) when _{1}) is called, all leaves in the subtree rooted at _{1} have the color, _{1}) returns, all leaves have the color, _{1}), then _{1} is colored according to _{1}. Hence, the correctness of the algorithm follows, assuming that _{2} compatible with the current coloring of the leaves.

Traversing _{1} using the “smaller-half trick” to ensure that each leaf changes color at most

Note that the color of a specific leaf is only changed whenever it is in the smallest subtree of one of its ancestors. Since the size of the smallest of the two subtrees of a node, _{1}, is at most half the size of the entire subtree rooted at _{1}, this can only happen at most log

The hierarchical decomposition tree of _{2}, _{2}), is a representation of _{2} as a binary tree data structure with logarithmic height. The nodes in _{2}) are called components, and each of them is of one of the six types illustrated in _{2} (including leaves) constitute the leaves of _{2}): each leaf in _{2} is contained in a component of type (i), and each inner node of _{2} is contained in a component of type (ii). An inner node/component of _{2}) represents a connected subset of nodes in _{2}, and it is formed as the union of two adjacent components using one of the compositions in _{2} connecting the two subsets of nodes they represent. The root of _{2}) is formed using the composition in _{2}.

The six different types of components in a hierarchical decomposition tree. Type (

To build _{2}), we first make a component of type (i) or (ii) for each node in _{2} and, then, greedily combine and replace pairs of components using the compositions in

To be able to retrieve the number of quartets in _{2} from _{2}) in constant time, we decorate it in the following way: For each component, _{2}), we store a tuple (_{2} represented by _{i}_{i}_{i}_{i}_{i}_{i}_{2} rooted at the endpoint of edge

It is beyond the scope of this paper to describe the details of how this decoration of _{2}) is initialized and updated, but the crucial point is that for a component, _{2}).

To further decrease the time complexity of our algorithm, we need a crucial lemma, which was hinted at in exercise 35 of [_{v}_{v}_{v}_{∈}_{T }_{v}_{1} in _{1} from |small(_{1})| log ^{2}

The first step in doing this is to use Lemma 3 from [_{1} ∈ _{1}, mark all leaves in the subtree of _{1} rooted at _{1} as non-contractible and contract _{2} according to this marking, then we get a contraction, _{2}^{(}^{v}_{1}^{)}, with at most 4|_{1}|− 5 nodes, and thus, we can build a hierarchical decomposition tree, _{2}^{(}^{v}_{1}^{)}), from this contraction with height _{1}|) in time _{1}|). Hence, if we use _{2}^{(}^{v}_{1}^{)}) instead of _{2}) when visiting _{1} ∈ _{1}, the time needed for visiting _{1}, disregarding the time spent on building _{2}^{(}^{v}^{)}, is now _{1})| log|_{1}|). However, this is still too much to use the extended smaller-half trick to obtain our goal.

To get all the way down to _{1} ∈ _{1} _{2}^{(}^{v}_{1}^{)}) has _{1}|) nodes, the |small(_{1})| leaf-to-root paths in _{2}^{(}^{v}^{)}) that need to be updated when visiting _{1} contain _{2}^{(}^{v}_{1}^{)}) after the |small(_{1})| color changes in two in the following way: (1) We first mark all internal nodes in _{2}^{(}^{v}_{1}^{)}) on paths from the |small(_{1})| type (i) components to the root by marking bottom-up from each of the type (i) components, until we find the first already marked component. (2) We then update all the marked nodes recursively in a postorder traversal starting at the root of _{2}^{(}^{v}_{1}^{)}).

We now use _{2}^{(}^{v}_{1}^{)} from scratch for every inner node, _{1} in _{1}, we would spend ^{2}) time just doing this. To fix this problem, we will only contract _{2} whenever a constant fraction of the leaves have been colored _{2} and _{2}, as well as their associated hierarchical decomposition trees.

Tree coloring algorithm.

The routine, _{2}), constructs a decomposition of _{2} of size _{2}|) time, where |_{2}| is the number of nodes in the current version of _{2}. Note that _{2} is not static, but is a parameter to _{1}), _{2}), uses the hierarchical decomposition of _{2} to extract a copy of _{2} at the point in the algorithm where _{1} is colored according to _{1}. The extracted tree is a copy of _{2} where all leaves in the subtree rooted at small(_{1}) still have the color, _{2} in the recursive call on small(_{1}) in line 15. The second call to _{2}| > 5|large(_{1})| (_{1}) in line 13.

Line 9 takes _{2}| = |_{1}| when we perform it, the total time used on this line is _{2} in line 12. We perform this line whenever |_{2}| > 5|large(_{1})|. Since all leaves in large(_{1}) are colored _{2} has at most 4|large(_{1})| − 5 nodes. Hence, the size of _{2} is reduced by a factor of 4/5. This implies that the sequence of contractions applied to a hierarchical decomposition results in a sequence of data structures of geometrically decaying sizes. Since a contraction takes time _{2}|), the total time spent on line 12 is linear in the initial size of _{2}, _{2}.

The main focus on algorithms for tree comparison has been on binary trees. This is understandable, since most algorithms for constructing trees will always create fully-resolved trees, even when some edges have very little support from the data (but, see Buneman trees or refined Buneman trees for exceptions to this [

In our research, we have developed a number of approaches for adapting the algorithms from the previous section to work on general trees, and we review these approaches in this section.

Problems with generalizing the dynamic programming algorithm for binary trees to general trees are two-fold. First, using edges to claim quartets is only meaningful for resolved quartets, and secondly, the resolved quartets claimed by an edge can be distributed over a large number of trees. The first problem can possibly be dealt with by using nodes to claim unresolved quartets, but the second problem is more serious.

We can still compute the tables of the intersections of trees, |^{2}) pairs of trees, but the total sum of degrees for each tree is bounded by ^{2}). Worse, however, is computing A(_{1},_{2}) where we need to consider all ways of picking trees _{2} and _{3} and pairing them with choices _{2} and _{3}, with a worst-case performance of ^{4}) for each pair of edges.

In a series of papers, we have developed different algorithms for computing the quartet distance between general trees efficiently by avoiding explicitly having to deal with choosing pairs of trees for inner nodes. Common for these is that we also avoid explicitly handling unresolved quartets, but only consider the resolved topologies and handle the unresolved quartets implicitly.

Christiansen ^{3}), is to consider all triplets, {^{3}) triplets, and the crux of the algorithm is counting the shared quartet topologies in constant time.

The approach is as follows: For each triplet, {_{i}_{j}_{k}_{1} connected to _{i}_{j}_{k}_{2}. A resolved quartet, _{i}_{i}_{i}_{i}_{i}_{i}_{j}_{j}_{k}_{k}

Computing quartets between high-degree nodes.

Now, let _{−}_{ijk}_{1}, except those in _{i}_{j}_{k}_{−}_{ijk}_{2} not in _{i}_{j}_{k}_{−}_{ijk}_{−}_{ijk}_{−}_{ijk}_{−}_{ijk}^{2}), since just from the number of indices (^{3}). Instead, we can compute:
_{−}_{ijk}_{−}_{ijk}_{−}_{ijk}_{−}_{ijk}_{i}_{−}_{ijk}_{j}_{−}_{ijk}_{k}_{−}_{ijk}_{i}_{j}_{k}_{l}_{−}_{ijk}_{l}_{l}_{l}_{i}_{l}_{j}_{l}_{k}

To get the running time of ^{3}), we thus only need to find the center nodes in constant time for each triplet, {

The idea is as follows: for each pair, ^{2})—we iterate through leaves ^{3}).

The second algorithm in Christiansen _{1} and _{2} are given by _{1}, _{2}) and B(_{1}, _{2}) that compute _{1}, _{1}), _{2}, _{2}), and:

Handling all trees hanging off the path from i to j.

Should one want to compute the parameterized quartet distance instead, it can be done a little more cumbersomely, but still using only the A and B counts:

The second algorithm from Christiansen _{1} and _{2} in _{1} and _{2}, respectively, and let _{1} and _{2} denote the destination nodes of the edges. If _{1} or _{2} are high-degree nodes, then the edges do not correspond to unique claims, _{2}, _{3}, _{2} and _{3}, are not uniquely defined (see _{1} and _{1} is one of the original edges, _{1} to

The algorithm then implements functions A and B by explicitly iterating through all pairs of extended claims. Each of the original ^{2}^{2}).

Expanding high-degree nodes.

The A function counts the number of equal topologies from the table of |

The B function, counting the number of resolved quartets, {_{1}, _{2} and _{3}) and

Because of symmetries and counting each quartet in two claims in each tree, this will over-count by a factor of four, so the number of resolved, but different, quartet topologies between the two trees is counted as:

Christiansen ^{2}

Using the same basic idea, but with yet another counting scheme and another set of tables counting sets of shared leaves, Nielsen ^{2+}^{α}^{ω}^{2.688}) and was the first guaranteed sub-cubic time algorithm for computing the quartet distance between general trees. Here, the underlying idea is to reduce an explicit iteration over ^{2}) pairs of claims to a matrix-matrix multiplication as part of the counting iteration. Again, we refer to the original paper for details [

The coloring approach of Brodal

The result of [^{9}_{1} and _{2}, where _{1},_{2}} for _{i}_{i}^{9}, on d stems from the HDT data structure having a ^{8}) time for the update of the decoration of a node based on its children after a color change.

The results of [

Johansen and Holt [_{1}, _{2}}, which is of significance if only one of the trees has a high degree.

The new HDT definition is based on the four types of components shown in _{2} and _{2}. The internal nodes of the HDT are formed by _{2}. During construction of the HDT,

The four different types of components.

The two types of transformations (

Additional changes include the use of

The anchors (white nodes) of resolved and unresolved triplets and quartets. Edges in the figures represents paths in the tree.

The remaining parts follow the lines of [_{1} recursively can be seen in _{1} have the color, zero. To initialize Invariant (1), all leaves are colored one at the start. As in

The main algorithm performing a recursive traversal of _{1}.

Most of the algorithms we have described in this review paper have been implemented in different software tools, and in this section, we experimentally compare their runtime performance. All experiments involved compare randomly generated balanced trees. We note that two randomly generated trees are expected to have a large distance, which influences the running time for the coloring algorithms. Similar trees require, overall, less updating in the hierarchical decomposition, so random trees are a worst-case situation for these. Experiments, not shown here, demonstrate that comparing similar trees can be significantly faster [

All experiments in this section were conducted on an Ubuntu Linux Server 12.04, 3.4 GHz 64-bit Intel Core i7-3770 (quad-core) with 32 GB of RAM.

For the quartet distance between binary trees, we performed experiments with three algorithms, the ^{2 } ^{2.688}) time algorithms from Nielsen

^{2.688}) time algorithm is faster than the implementation of the ^{2}

Quartet distance running time on binary trees.

^{2.688}) time algorithm is faster on small trees (where the exact point where the other algorithm is faster depends on the degree).

Quartet distance running time on non-binary balanced trees.

We note, however, that the runtime of the algorithm from [

For the triplet distance between binary trees, we compared the general algorithm from [^{2}

Triplet distance running time. For the ^{2}

We have presented a series of algorithmic improvements for computing the triplet and quartet distance between two general trees that we have developed over the last decade. Our work has followed two main approaches, one based on counting shared topologies using tables of the intersections of subtrees and one based on coloring labels and counting compatible topologies using a hierarchical decomposition data structure. The second approach has resulted in the currently best worst-case running time of

While the theoretical fastest algorithms involve rather complex bookkeeping for counting topologies, we have shown that they can be implemented to be efficient in practice, as well, computing the distance between two trees with a million leaves in a few minutes. With more typical phylogenetic tree sizes, with the number of leaves in the hundreds or low thousands, the distance can be computed in less than a second.

T.M. receives funding from The Danish Council for Independent Research, grant no 12-125062.

^{2}

^{2}