To put sequences of different species in perspective and to understand historical evolution, as well as try to predict future directions of the development of life on Earth, a “phylogeny” [
137] (evolutionary tree or network) needs to be constructed. Indeed, some see the reconstruction and interpretation of a species phylogeny as the pinnacle of biological research [
138]. A likely evolutionary scenario can be constructed from a multiple alignment, a character-state matrix, or a collection of sub-phylogenies, and methods for this are plentiful [
139,
140]. In the scope of this survey, however, we focus on recent results for NP-hard problems surrounding phylogenetic trees and networks.
5.1. Preliminaries
An evolutionary (or phylogenetic) network
N is a graph whose degree-one nodes
(“leaves”) are labeled (by “taxa”). A
rooted phylogenetic network
N is a rooted acyclic directed graph whose non-root nodes either have in-degree one or out-degree one and whose out-degree-zero nodes
(“leaves”) are labeled (by “taxa”). A (rooted) phylogenetic tree is a (rooted) phylogenetic network whose underlying undirected graph is acyclic. In the context of this section, we drop the prefix “phylogenetic” for brevity and sometimes refer to networks as “phylogenies”. Some works consider trees in which each leaf-label may occur more than once. These objects are called
multi-labeled trees (or MUL trees). For a set
of networks, we abbreviate
. An important parameter of networks is their
level, referring to the largest number of reticulations in any biconnected component of the underlying undirected graph (see [
141]).The
restriction of
T to
L, denoted by
, is the result of removing all leaves not in
L from
T and repeatedly removing unlabeled leaves and suppressing (that is, contracting any one of its incident edges/arcs) degree-two nodes. A network
N displays a network
T if
T is a topological minor of
N, respecting leaf-labels, that is,
N contains a subdivision of
T as a subgraph. Herein, a directed edge can only be subdivided in accordance to its direction, that is, the subdivision of an arc
creates a new node
w and replaces
by the arcs
and
. This notion can be generalized to the case that
N is the disjoint union of some networks (see
Section 5.2.3). For non-binary networks, there are two different notions of display: the “hard”-version is defined analogously to the binary case, while we say that a network
N “soft”-displays a network
T if any binary resolution of
N displays any binary resolution of
T, where a
binary resolution of a network
N is any binary network that can be turned into
N by contracting edges/arcs. “Soft” and “hard” versions of display are derived from the concept of “soft” and “hard” polytomies, meaning high degree nodes that represent either a lack of knowledge of the correct evolutionary history leading to the children taxa (“soft”) or a large fan-out of species due to high evolutionary pressure (“hard”).
Note that many kernelization results in phylogenetics bound only on the number of labels in a reduced instance. If the input contains many trees or an intricate network, kernelization results should more fittingly be described as “partial kernel” (see [
142]). We thus usually refer to such results as “kernel with … taxa”.
5.2. Combining and Comparing Phylogenies
Problem Description. An approach to reconstruct a phylogeny from the genomes of a set of species is to first reconstruct the phylogenies of the genes (using multiple alignments and after clustering them together into families) into so-called “gene trees” and then to combine these trees into a tree representing the evolutionary history of the set of species called the “species tree” (see also
Section 5.3 for more on the divergence between gene trees and species trees). In general, given trees
each on the taxon-set
, we want to “amalgamate” the trees into a single tree
T, which, since it agrees with and contains all the
, is called an
agreement supertree. This problem is known as
Tree Consistency (and, sometimes
Tree Compatibility).
For surveys about the combination of phylogenies (“consensus methods”) we refer to Bryant [
143] and Degnan [
144].
Results. For unrooted trees, TCY can be solved in polynomial time if all
share a common taxon [
145] but is NP-complete in general, even if all input trees contain four taxa [
145] (such a “quartet” is the smallest meaningful unrooted tree since unrooted trees with at most three taxa do not carry any phylogenetic information). This restricted problem is also known as
Quartet Inconsistency. TCY can be solved in polynomial-time for two unrooted (non-binary) trees [
146]. More generally, using powerful meta-theorems [
147,
148] (problems formulatable in “Monadic Second Order Logic” (MSOL) are FPT for the treewidth of the input structure), TCY is fixed-parameter tractable for the treewidth of the display graph [
149,
150] (that is, the result of identifying all leaves of the same label in the disjoint union of the input trees), which is smaller than the number
t of trees. Baste et al. [
151] improved the impractical running time resulting from the application of the meta-theorems, showing an
-time algorithm.
For rooted trees,
Tree Consistency can be solved in polynomial time [
152,
153] (even for non-binary trees) but, due to noisy data and more complicated evolutionary processes, practically relevant instances are not expected to have an agreement supertree [
154,
155]. Thus, derivations of the problem arose, asking for a smallest amount of modification to the input such that an agreement supertree exists. The most prominent modification types are removing trees (
Rooted Triplet Inconsistency), removing taxa (
Maximum Agreement Supertree), and removing edges (
Maximum Agreement Forest), which we discuss in the following.
5.2.1. Consensus by Removing Trees
When reconstructing a species tree from gene trees, we may hope that the gene trees of most of the sampled gene families actually agree with the species phylogeny and only few such families describe outliers that developed nonconformingly. In this case, we can hope to recover the true phylogeny by removing a small number of gene trees.
Results. Rooted Triplet Inconsistency is NP-hard [
156,
157,
158], even on “dense” triplet sets [
159] (a triplet set
is called
dense or
complete if for each leaf-triple
,
contains exactly one of
,
and
). While the general problem is W[2]-hard for
k [
159], the dense version admits parameterized algorithms. Indeed, Guillemot and Mnich [
160] showed parameterized algorithms running in
and in
time, as well as an
-time computable, sunflower-based kernel containing
taxa (see [
161] for details on the sunflower kernelization technique). Their result has recently been improved to linear size by Paul et al. [
162].
Notes. Generalizing RTI to ask for a level-
ℓ network displaying the input trees yields a somewhat harder problem, which can be solved for dense inputs in
time [
163] and in
time for a particular class of networks [
164].
The unrooted-tree version of dense RTI (where the input consists of quartets) is known to be solvable in
time [
165].
5.2.2. Consensus by Removing Taxa
In many sciences, the most interesting knowledge can be gained by looking more closely to the non-conforming data points. In this spirit, biologists are particularly interested in taxa causing non-compatibility, that is, whose removal allows for an agreement supertree. In the spirit of parsimony, we are thus tempted to ask for a smallest number of taxa to remove from the input trees such that an agreement supertree exists.
Results. While MASP can be solved in
time for two rooted trees [
166,
167,
168] (
n denoting the total number of labels in the input), it is NP-hard for
, even if all
are triplets [
166,
169], and the NP-hardness persists for fixed
t [
166] (but large trees). Guillemot and Berry [
170] showed that, on dense, binary, rooted inputs, MASP can be solved in
and
time by reduction to 4-
Hitting Set. They further improved an
-time algorithm of Jansson et al. [
166] for binary
to
time [
170], which was subsequently improved to
time by Hoang and Sung [
171]. The latter also gave an
-time algorithm for general rooted inputs (
denoting the maximum out-degree among the input trees). MASP is W[2]-hard for
k, even if the input consists of rooted triplets [
169], and W[1]-complete in the rooted case for the dual parameter
, even if we add
t to it [
170]. On the positive side, the problem can be solved in
time for binary trees [
170] which has been generalized to arbitrary trees by Fernández-Baca et al. [
172]. For completeness, we want to point out that many of the results for MASP also hold for MASP’s sister problem
Maximum Compatible Supertree (MCSP), in which equality with the restricted agreement supertree
is relaxed to being a contraction of
(with the notable exception that MCSP can be solved in
time in both the rooted and unrooted case [
171]).
Notes. The special case of MASP in which
is called
Maximum Agreement Subtree (MAST) and has been studied extensively. While still NP-hard for
non-binary trees [
173], MAST can be solved in
[
157,
174], in which time we can also compute a “kernel agreement subtree”, denoting the intersection of all leaf-sets of all optimal maximum agreement subtrees [
175]. MAST is fixed-parameter tractable for
k with parameterized algorithms running in time
[
176,
177,
178] (by reduction to 3-
Hitting Set).
More fine-grained versions of MAST that allow removal of different taxa from each
were introduced by Chauve et al. [
179]. In
Agreement Subtree by Leaf-Removal (AST-LR), the objective is to minimize
the total number q of removed leaves and, in AST-LR-
d, the objective is to minimize
the maximum number d of leaves that have to be removed from any of the trees. Both versions are NP-hard [
179] but can be solved in
(AST-LR) and
time for some constant
c (AST-LR-
d) [
179,
180].
Lafond et al. [
181] considered MAST for multi-labeled trees showing that it remains NP-hard and can be solved in
time.
Finally, Choy et al. [
141] showed that a “maximum agreement sub
network” for two binary networks of level
and
, respectively, can be computed in
time.
5.2.3. Consensus by Removing Edges—Agreement Forests and Tree Distances
An important biological phenomenon that governs the discordance of gene trees are non-tree-like processes such as hybridization and horizontal gene transfer (HGT) (see also
Section 5.3). If a branch in a gene tree corresponds to a horizontal transfer, then we expect that deleting this branch results in a forest, which is in agreement with the other gene trees. This gives rise to the idea of “agreement forests”, resulting from the deletion of branches in the input phylogenies.
Maximum agreement forests come in three major flavors:
unrooted maximum agreement forests (uMAFs),
rooted maximum agreement forests (rMAFs), and maximum
acyclic agreement forests (MAAFs). Herein, “acyclic” makes reference to the constraint that the “inheritance graph is acyclic” (see
Figure 9). Formally, the
inheritance graph of an agreement forest
F for two trees
and
has the trees of
F as nodes and an arc
if and only if the root of
u is an ancestor of the root of
v in
or in
. Demanding acyclicity of this graph forbids, for example, that a tree
u of
F is “above” another tree
v in
but “below”
v in
. This definition generalizes straightforwardly to more than two trees
. In the following, the
size of an agreement forest
F is the number of trees in
F and it is equal to the number of branches to remove in each input tree to form (a subdivision of)
F and
F is called
maximum if it minimizes this number. For surveys about tree distances and agreement forests, we refer to Shi et al. [
182] and Whidden [
183].
Results. Evidently, results heavily depend on the type of agreement forest we are looking for. Interestingly, each of the three versions corresponds to a known and well-studied distance measure between trees and we thus also include results stated for the corresponding distance-measure.
- Unrooted Agreement Forest.
The size of a uMAF of two binary trees
and
is exactly equal to the minimum number of “TBR moves” necessary to turn
into
[
184,
185] (and vice versa; indeed, this defines a metric and it is called the “TBR distance” between
and
). Herein, a
TBR (tree bisection and reconnection)
move consists of removing an edge
from a tree (“bisecting” the tree) and inserting a new edge between any two edges of the resulting subtrees (“reconnecting” the trees), that is, subdividing an edge in each of the subtrees and adding a new edge between the two new nodes.
For two trees, deciding uMAF is NP-hard [
184], but fixed-parameter tractable in
k. More precisely, the problem can be solved in
time [
185],
time [
186], and
time [
187]. These results make use of the known kernelizations with
[
185,
188] and
taxa [
189]. For
binary trees, Shi et al. [
190] presented an
-time algorithm. Chen et al. [
191] considered the uMAF problem on multifurcating trees, showing that it still corresponds to the TBR problem and can be solved in
time.
- Rooted Agreement Forest.
The size of an rMAF of two rooted binary trees
and
is exactly one more than the minimum number of “rSPR moves” necessary to turn
into
[
192,
193] (and vice versa; indeed, this defines a metric and it is called the “rSPR distance” between
and
). Herein, an
rSPR (rooted subtree prune and regraft)
move consists of removing (“pruning”) an arc
from a tree and “regrafting” it onto another arc
, that is, subdividing
with a new node
z and inserting the arc
.
The problem is known to be NP-hard and algorithms parameterized by
k have been extensively studied and improved. An initial
-time algorithm [
187,
194] was improved to
time [
187],
time [
195],
time [
196], and the current best
-time algorithm by Whidden [
183]. In contrast, a kernel with
taxa [
193] has stood since 2005. For
trees, rMAF can be decided in
time [
197] and
time [
190].
Collins [
198] showed that using “soft”-display, rMAFs still correspond to computing the rSPR distance between two
multifurcating trees. This problem can be solved in
time [
199] and in
time [
183,
200] and admits a kernel with
taxa [
198,
201]. For
trees, the multifurcating rMAF problem is solvable in
time [
202]. Notably, Shi et al. [
202] also considered the “hard” version of the problem and presented an
-time algorithm for it.
- Acyclic Agreement Forest.
The size of a MAAF of two rooted binary trees
and
is exactly one more than the minimum number of reticulations found in any phylogenetic network displaying both
and
[
192] and this relation holds also if
and
are
non-binary [
203].
Deciding this number is known as the
Hybridization Number (HN) problem and it has been shown to be NP-hard by Bordewich and Semple [
204]. The problem can be solved in
time by crawling a bounded search-tree [
205]. In 2009, Whidden and Zeh claimed an
-time algorithm, which they later retracted and replaced by an
-time algorithm [
183,
195]. For
binary trees, HN can be decided in
time, where
c is an “astronomical constant” [
206].
Concerning preprocessing, a kernel with at most
taxa is known [
188,
201] and this kernelization result has been generalized to the case of deciding HN for
binary trees (in which case HN and MAAF no longer coincide) by van Iersel and Linz [
207], showing a kernel with
taxa for this case, which has again been generalized to
non-binary trees by van Iersel et al. [
208], showing a kernel with at most
(and at most
) taxa [
208]. For MAAF with
non-binary trees, Linz and Semple [
203] showed a linear bikernel (that is, a kernelization into a different problem, see [
209]) with
taxa, which implies a quadratic-size classical kernel. For this setting, algorithms running in
[
210] and
[
211] time are also known.
Notes. Any algorithm for binary uMAF, rMAF, and MAAF with running time
can be turned into an algorithm running in
, where
ℓ is the level of any binary network displaying the two input trees [
212]. Furthermore, all three agreement forest variants are fixed-parameter tractable for the treewidth of the display graph of the input trees [
213] (see corresponding results for unrooted
Tree Consistency [
149,
150]). The rSPR distance has been generalized to a distance measure for phylogenetic networks called SNPR and its computation is fixed-parameter tractable [
214] parameterized by the distance. Variations of the discussed distance measures include: (1) the uSPR distance, which does not have an agreement-forest formulation, is NP-hard to decide [
215], admits a kernel with
taxa [
216] (in a preprint, Whidden and Matsen [
217] claimed an improvement to
taxa), and can be calculated in
time [
218] (using the mentioned preprint-kernel); (2) its close sibling, the replug distance, which admits a formulation as “maximum endpoint agreement forest”, is conjectured to be NP-hard to decide but admits an
-time algorithm [
218]; (3) the “temporal hybridization number”, denoting the smallest amount of reticulation required to explain trees with a temporal network, which was shown to be NP-hard for two trees [
219] but admits an
-time algorithm [
220]; and (4) the parsimony distance, which is NP-hard [
221,
222] but can be solved in
time [
223], where
is the TBR distance between the input trees.
Open Questions. Consensus methods in phylogenetics can profit from a wide range of parameters, describing the particularities of likely set of inputs. While we would indeed expect that the consensus has a small distance to the input trees, a lot depends on how we choose to measure this distance. More general distance measures make for stronger parameters and, while the Hybridization Number problem can be solved in single-exponential time for the “standard parameter” k, it would be interesting to parameterize by a stronger parameter, such as the rSPR radius in which the input trees lie.
Inspired by the groundbreaking results of Bryant and Lagergren [
149], research into the display graph and its treewidth has been conducted with some success [
150,
224]. However, we have yet to design concrete, practical algorithms for consensus problems parameterized by the treewidth of the display graph. As this would potentially yield very fast, practical algorithms, we suspect that this would be a fruitful topic in the coming years. Another interesting parameter is the “book thickness” of the display graph, that is, the minimum number of edge-colors needed to color the display graph such that each color class permits an outerplanar drawing. For obvious reasons, this parameter is smaller than the number
t of input trees. Can the results for
t be strengthened to work with the book thickness instead?
5.3. Reconciliation
Problem Description. In practice, trees depicting the evolutionary history of families of genes sampled from a set of species do not agree with the evolutionary history of the species themselves; hybridization, horizontal gene transfer, and incomplete lineage sorting being only few known causes for such discrepancies. In theory, even gene duplication and gene loss are enough to explain gene trees differing arbitrarily from the corresponding species tree. To better understand how a family of genes developed in the genome of a concurrently developing set of species, we can compute an “embedding” of the gene tree nodes to the edges of the species phylogeny called a
reconciliation (see
Figure 10). Reconciliations also allow drawing conclusions when comparing phylogenies of co-evolving species such as hosts/parasites or flowers/pollinators. More formally, a
DL-reconciliation of a (gene-)phylogeny
G with a (species-)phylogeny
S is a pair
where
is a subdivision of
G and
is a mapping such that:
- (a)
for all arcs of G, either (in which case u is called “duplication”) or is a child of (in which case u is called “speciation”);
- (b)
for arcs and in G, we have (that is, no node of G can be a speciation and a duplication at the same time); and
- (c)
if u is a leaf in G, then is a leaf labeled with the contemporary species that was sampled in.
We can then define the number of losses in as the sum over all speciations u of the outdegree of in S minus the outdegree of u in . If horizontal transfers are allowed, Condition (a) is replaced by
- (a’)
for all arcs of G, either (in which case u is called “duplication”) or is a child of (in which case u is called “speciation”), or is incomparable to (in which case u is called a “transfer” and is called a “transfer arc”),
and Condition (b) is restricted to non-transfer arcs, and each transfer with out-degree one causes an additional loss. In this case, we call r a DTL-reconciliation. Since, in reality, transfers occur only between species existing at the same time, Condition (a’) introduces further restrictions. In particular, a reconciliation r is called time-consistent if G can be “dated”, that is, there is a mapping such that, for all arcs of G, we have 1. ; and 2. if and only if is a transfer arc of G under r.
A DTL-reconciliation may be time-inconsistent if, for example, there are transfer arcs and such that u and x are ancestors of y and v, respectively, in G.
Now, the parsimonious principle is used to define an optimization criterion. To this end, each evolutionary event is given a cost such as to reflect how unlikely it is to see a certain event. By the biological setup, it is usually assumed that speciations have cost zero.
We specify the allowed type of embedding by prepending “DL”, “DTL”, etc. to the problem name. The study of the formal reconciliation problem was initiated by Ma et al. [
225] and Bonizzoni et al. [
226] and is surveyed in [
227,
228,
229,
230].
Results. Optimal binary DL-reconciliations can be computed—independently of the costs—by the
LCA-mapping, which maps each node of
G to the LCA of the nodes that its children are mapped to [
231]. Thus, this problem can be solved in linear time [
232,
233] using
-time LCA queries [
234,
235]. The non-binary variant, while not quite as straightforward, can still be solved in polynomial time [
236,
237].
The complexity of computing DTL-reconciliations depends heavily on their time-consistency. If we allow producing time-inconsistent reconciliations or the given species tree already comes with a dating function, then optimal DTL-reconciliations can be computed in polynomial time [
238,
239,
240]. In general, however, the problem is NP-hard [
238,
241], but can be solved in
time [
238]. Hallett and Lagergren [
242] showed that DTL-reconciliations with at most
speciations mapped to any one node in the species tree is FPT in
.
Duplication, transfer, and loss are not the only evolutionary events shaping a gene tree. Hasić and Tannier [
243,
244] recently introduced “replacing transfers” (T
R) and “gene conversion” (C) which model important evolutionary events and showed that DLC-reconciliations can be decided in polynomial time [
243] while deciding T
R-reconciliations is NP-hard and FPT for
k [
244]. Finally, the concept of “incomplete lineage sorting” (ILS) is an important factor influencing discrepancies between gene and species phylogenies, especially when speciation occurs in rapid succession [
245]. Roughly, ILS refers to the possibility that an earlier duplication or transfer does not pervade a population at the time a speciation occurs. Thus, one branch of a speciation may carry a gene lineage while the other does not (see
Figure 11 for an illustration). In DL-reconciliation, this scenario requires a loss, but this would not reflect the true evolutionary history. Unfortunately, no particular mathematical model of ILS is widely accepted, so the following results might be incomparable. In 2017, Bork et al. [
246] showed that incorporating ILS into DL-reconciliation makes the problem NP-hard, even for dated species trees. Furthermore, DTL-reconciliation for dated, non-binary species trees allows ILS to be computed in
time [
247,
248].
To and Scornavacca [
249] started looking into the problem of reconciling rooted gene trees with a rooted species network, showing that, for the DL model, this problem is NP-hard, but solvable in
time.
Open Questions. There are three major challenges for bioinformatics concerning reconciliation. The first and more obvious task is to include all known genomic players in the reconciliation game, meaning to establish a standard model incorporating duplication, transfer, loss, replacing transfers, conversion, and incomplete lineage sorting. While Hasić and Tannier [
243,
244] made good progress towards this goal, their model seems too clunky and lacks ILS support. The second challenge is to remove the need to provide
,
, etc. in the input for the
Reconciliation problem. In practice, some biologists using implementations of algorithms for reconciliations just “play around” with these numbers until the results roughly fit their expectations, which is understandable since nobody knows the correct values. Indeed, in all likelihood, there are no “correct values” because the underlying assumption that the rates of genetic modification is constant throughout a phylogeny is invalid [
250]. A more realistic approach might define expected frequencies of events for each branch and combine them with the length of this branch in order to dynamically price duplication, transfer, loss, etc. in this branch.
5.4. Miscellaneous
Given a phylogeny
T, the
parsimony score of
T with respect to a labeling
c of its nodes is the number of arcs in
T whose extremities have a different label under
c. In the
Small Parsimony problem, we are given a phylogeny
T and a leaf-labeling
and have to extend
to a labeling of all nodes of
T such as to minimize the parsimony score. If
T is a network, aside from the above definition (“hardwired”), the “softwired” version exists, asking for the minimum parsimony score of any tree
(on the same leaf-set as
T) displayed by
T. While
Small Parsimony is polynomial for trees [
251], the problem is NP-hard in the softwired case, even for binary
T and
, as well as in the hardwired case, unless
is binary [
252,
253]. Fischer et al. [
253] also showed that hardwired
Small Parsimony is FPT for the solution parsimony score and softwired
Small Parsimony is FPT for the level of the input network.
The problem of deciding whether a phylogeny
T is displayed by another phylogeny
N is called
Tree Containment and it is NP-hard, even if
T is a tree [
254].
Tree Containment has polynomial-time algorithms for many special cases of
N [
254,
255,
256,
257,
258,
259,
260,
261] and can be solved in
time [
262] (where
ℓ denotes the level of
N; the authors also showed an algorithm for the related
Cluster Containment problem) and in
time [
261], where
t is the number of “invisible tree components” (that is, the number of tree-nodes
u whose parent
v is a reticulation that is not ”visible“ in
N (that is, for each leaf
a, there is a root-
a-path avoiding
v)).
Tree Containment stays NP-hard even if the arcs of both
T and
N are annotated with “branch lengths”, but admits an
-time algorithm in this case [
263].
Recent research into the problem of rooting an unrooted network was conducted by Huber et al. [
264], showing that orienting an undirected binary network as a directed network of a certain class is FPT for the level
ℓ for some classes of
N.