Arc-Completion of 2-Colored Best Match Graphs to Binary-Explainable Best Match Graphs

Best match graphs (BMGs) are vertex-colored digraphs that naturally arise in mathematical phylogenetics to formalize the notion of evolutionary closest genes w.r.t. an a priori unknown phylogenetic tree. BMGs are explained by unique least resolved trees. We prove that the property of a rooted, leaf-colored tree to be least resolved for some BMG is preserved by the contraction of inner edges. For the special case of two-colored BMGs, this leads to a characterization of the least resolved trees (LRTs) of binary-explainable trees and a simple, polynomial-time algorithm for the minimum cardinality completion of the arc set of a BMG to reach a BMG that can be explained by a binary tree.


Introduction
Best match graphs (BMGs) are vertex-colored digraphs that appear in mathematical phylogenetics as a repesentation of a gene's evolutionary closest relatives in another species [6]. That is, given a rooted tree T , a vertex (gene) x in the BMG G(T, σ) is colored by the species σ(x) in which it resides, and there is an arc (x, y) if there is no other gene y in species σ(y ) = σ(y) = σ(x) with a later last common ancestor than the last common ancestor lca T (x, y) of x and y in T . Although rooted trees are crucial for the definition of BMGs, they are, however, unknown in practice and we are often only left with estimates of their BMGs. In general, there are multiple trees that "explain" the same BMG. There is, however, a unique least resolved tree (LRT) for each BMG, which can be obtained from T by contracting certain edges [6]. The LRTs will play a central role in this contribution. The subgraph of a BMG induced by the vertices of some subset of colors is again a BMG. Every BMG therefore can be viewed as the disjoint union of (the arc sets of) 2-colored BMGs. These 2-BMGs [6,10,11] are bipartite and form a common subclass of the sink-free digraphs [1,2] and the bi-transitive digraphs [3].
Estimates of graphs from real-life data tend to be affected by noise and thus typically will violate the defining properties of the desired graph class. The solution of a corresponding graph modification problem [14] therefore can by employed as a means of noise reduction, see e.g. [8].
The arc modification problems (deletion, completion, and editing) for BMGs are NP-complete in general [18], and remain hard even for the special case of 2 colors.
Phylogenetic trees are often considered to be binary in theory. Most polytomies are therefore considered a limitation of the available data or method of tree reconstruction [4,13] rather than a biological reality [9,19]. In the setting of BMGs, this distinction is important because not all BMGs can be derived from binary gene trees. Instead, binary-explainable BMGs (beBMGs) form a proper subclass [15] that is distinguished by a single forbidden induced subgraph, the hourglass, from other BMGs [16]. The arc modification problems for beBMGs are NP-complete [15,18] as well.
In the context of correcting empirical best match data, it is natural to ask whether the problem of modifying a BMG to a beBMG is as difficult as the general case. It is, in fact, not unusual that graph modification problems that are hard in general become easy when the input is confined to a -usually restrictive -class of graphs, see e.g. [5,12]. Here we show that the problem of completing a 2-colored BMG to a beBMG can indeed be solved in polynomial time.
To prove this result we make use of the fact that every BMG is associated with a unique least resolved tree (LRT). Thm. 1 shows that the property of being the LRT for some BMG is preserved under contraction of inner edges. This observation leads to the explicit construction of a "collapsed tree" from the LRT of the input BMG (G, σ) which not only is the LRT of a 2-colored beBMG but also minimizes the number of arcs that need to be inserted to obtain a beBMG from (G, σ). The construction does not generalize to more than 2 colors.

Notation
We consider simple directed graphs (digraphs) and rooted (undirected) trees T with root ρ. Correspondingly, we write (x, y) for directed arcs from x to y, and xy for undirected tree edges. Given a tree T , we write V (T ) and E(T ) for its set of vertices and edges, resp., L(T ) for the set of leaves, and V 0 (T ) = V (T )\L(T ) for the set of inner vertices.
A vertex coloring of a graph is a map σ : V → S, where S is a non-empty set of colors. A vertex coloring of G is proper if σ(x) = σ(y) for all (x, y) ∈ E(G). We will also consider leaf-colorings σ : L(T ) → S for trees T , and denote by (G, σ) and (T, σ) vertex-colored graphs and leaf-colored trees, respectively.
Given a rooted tree, we write x T y if y is an ancestor of x, i.e., if y lies along the unique path from ρ to x in T . We write x ≺ T y if x T y and x = y. The relation T is a partial order on T . If xy ∈ E(T ) and x ≺ T y, then y is the unique parent of x, denoted by par T (x), and x a child of y. The set of children of a vertex u ∈ V (T ) is denoted by child T (u). A rooted tree T is phylogenetic if every inner vertex x ∈ V 0 (T ) has at least two children. All trees in this contribution are assumed to be phylogenetic. Furthermore, we write T (u) for the subtree rooted in u, i.e., V (T (u)) = {y ∈ V (T ) | y T u}. The last common ancestor of a non-empty subset A ⊆ V (T ) is the unique T -minimal vertex of T that is an ancestor of every u ∈ A. For convenience, we write lca(x, y, . . . ) instead of lca({x, y, . . . }).
A triple xy|z is a rooted tree with the three leaves x, y, and z such that lca(x, y) ≺ lca(x, y, z). If e ∈ E(T ), we denote by T e the tree obtained by contracting the edge e. We will only be interested in contractions of inner edges, i.e., those that preserve the leaf set. We say that T displays a tree T , in symbols T ≤ T , if T can be obtained from T as the minimal subtree of T that connects all elements in L(T ) with root lca T (L(T )) and by suppressing all inner vertices that only have one child left.

Best Match Graphs, Least Resolved Trees, and Binary-Explainable BMGs
In this section, we first summarize some properties of best match graphs and their least resolved trees. We then show that the contraction of inner edges in least resolved trees always leads to least resolved trees. Furthermore, we recall some properties of binary-explainable best match graphs that will be needed later.
Definition 1. Let (T, σ) be a leaf-colored tree. A leaf y ∈ L(T ) is a best match of the leaf x ∈ L(T ) if σ(x) = σ(y) and lca(x, y) T lca(x, y ) holds for all leaves y of color σ(y ) = σ(y).

Proposition 1. [16, Lemma 8] If T A is obtained from a tree T by contracting all edges in a subset
A of inner edges in T , then G(T, σ) ⊆ G(T A , σ).
An edge e of a leaf-colored tree is redundant (w.r.t. (G, σ)) if it can be contracted without affecting the BMG, i.e., if G(T, σ) = G(T e , σ).
We define the notion of being least resolved here as a property of the tree (T, σ) alone. Of course, every least resolved tree is also least resolved w.r.t. some BMG, namely the (uniquely defined) graph G(T, σ).
It is shown in [6] that (T, σ) is least resolved if and only if it does not contain a redundant edge. In particular, we have Proposition 2. [6, Thm. 8] Every BMG (G, σ) is explained by a unique least resolved tree (LRT), which is obtained from an arbitrary tree (T, σ) that explains (G, σ) by contraction of all redundant edges of (T, σ).
In particular, therefore, there is a bijection between BMGs and LRTs. Surprisingly, the property of being least resolved for some BMG is preserved under contraction of inner edges of T . Theorem 1. Suppose (T, σ) is least resolved and let A be a set of inner edges of T , and denote by T A the tree obtained from a tree T by contracting all edges in A. Then (T A , σ) is again least resolved.
Proof. Assume that (T, σ) is least resolved, i.e., it does not contain any redundant edges, and set (G, σ) := G(T, σ). Lemma 7 in [16] states that an inner edge ). The statement trivially holds if (T, σ) has at most one inner edge. Hence, we assume that (T, σ) has at least two distinct inner edges e = uv and e . We show that every non-redundant edge e in T remains non-redundant in T e . Thus, let e be a non-redundant edge in T . Hence, there is an arc (a, b) ∈ E(G) such that lca T (a, b) = v and σ(b) ∈ σ(L(T (u)) \ L(T (v))). Now consider the tree T e obtained from T by contraction of the inner edge e = e. Clearly, we also have lca T e (a, b) = v and σ(b) ∈ σ(L(T e (u)) \ L(T e (v))). Prop. 1 implies G(T, σ) ⊆ G(T e , σ), and thus, (a, b) ∈ E(G(T e , σ)). Making again use of the characterization of redundant edges in [16,Lemma 7], we conclude that e is non-redundant in (T e , σ).
Since both e and e were chosen arbitrarily, we observe that the contraction of a single inner edge does not produce new redundant edges. We can therefore apply this argument for each step in the consecutive contraction of all edges in A (in an arbitrary order) to conclude that (T A , σ) does not contain redundant edges. Therefore, Prop. 2 implies that (T A , σ) is least resolved.
As another immediate consequence of Thm. 1 and uniqueness of the LRT of a BMG (Prop. 2), we obtain Corollary 2. If e and e are two distinct inner edges of a least resolved tree (T, σ), then G(T e , σ) = G(T e , σ).
Let us now turn to the subclass of BMGs that can be explained by a binary tree.
As shown in [16], beBMGs can be characterized among BMGs by means of a simple forbidden colored induced subgraph: Definition 5. An hourglass in a properly vertex-colored graph (G, σ), denoted by [xy An hourglass together with a (non-binary) tree explaining it is illustrated in Fig. 1(A). A properly vertex-colored digraph that does not contain an hourglass as an induced subgraph is called hourglass-free.
It is worth noting that the LRTs of beBMGs are usually not binary. In fact, it is shown in [15] that, for a beBMG (G, σ), there exists a unique binary refinable tree (BTR) B(G, σ) with the property that every binary tree (T, σ) that displays B(G, σ) explains (G, σ). The BRT is in general much better resolved than the LRT of (G, σ).

Two-Colored BMGs
Let us now briefly focus on 2-colored BMGs (2-BMGs). Since arcs in BMG can only connect vertices with different colors, every 2-BMG is bipartite. Furthermore, every leaf x in a tree with two leaf colors has at least one best match y. Every 2-BMG is therefore sink-free, i.e., every vertex has at least one out-neighbor. Furthermore, Schaller et al. [18] showed that the following graphs (see also Fig. 1(B)) are forbidden induced subgraphs for 2-BMGs.
Definition 7 (Support Leaves). For a given tree T , the set S u := child T (u) ∩ L(T ) is the set of all support leaves of vertex u ∈ V (T ).
We note in passing that every inner vertex u of the LRT of a 2-BMG (G, σ), with the possible exception of the root ρ, has a non-empty set of support leaves S u , and S ρ = ∅ if and only if (G, σ) is connected [17]. In the following, we will make use of a connection between a 2-BMG and its LRT: Lemma 1. Let (G, σ) be a 2-BMG, (T, σ) its LRT and x, y ∈ L(T ) = V (G). Then (x, y) ∈ E(G) if and only if σ(x) = σ(y) and y ∈ L(T (par T (x))).
Proof. First note that, since (G, σ) is 2-colored, (T, σ) has at least two leaves and u := par T (x) is always defined. First, assume σ(x) = σ(y), and thus x = y, and let y ∈ L(T (u)). Since x is a child of u, we have lca T (x, y) = u. Moreover, since u is the parent of x, there is no vertex y of color σ(y) such that lca T (x, y ) ≺ T lca T (x, y) = u. Hence, y is a best match of x, i.e., (x, y) ∈ E(G). Now suppose, for contraposition, that σ(x) = σ(y) or y / ∈ L(T (u)). If σ(x) = σ(y), then, by definition, (x, y) / ∈ E(G). If y / ∈ L(T (u)), then u ≺ T ρ T . Hence, we can apply Cor. 1 in [17] to the inner vertex u to conclude that |σ(L(T (u)))| > 1, i.e., the subtree L(T (u)) contains both colors. Thus, we can find a vertex y of color σ(y) such that lca T (x, y ) T u ≺ T lca T (x, y) which implies that (x, y) / ∈ E(G).
As an immediate consequence, we find

Completion of a 2-BMG to a 2-beBMG
Writing G + F := (G, E ∪ F ) for a graph G = (V, E) and arc set consider the following graph completion problem: Problem 1 (2-BMG Completion restricted to Binary-Explainable Graphs (2-BMG CBEG)).

σ) is a binary-explainable 2-BMG?
In the general case, 2-BMG CBEG is NP-complete [15,Cor. 5.11]. Here we are interested in the restriction of the 2-BMG CBEG problem with BMGs as input.
The following result holds for BMGs and their completions to beBMGs with an arbitrary number of colors. Proof. It is shown in [6, Obs. 1] that the subgraphs of a BMG induced by all vertices with any two given colors is a 2-BMG. Since (G , σ) is a (binary-explainable) BMG, all of its 2-colored induced subgraphs are therefore 2-BMGs. By assumption, (G, σ) is not binary-explainable since it contains the hourglass [xy x y ] as an induced subgraph (cf. Prop. 3). The hourglass contains all possible arcs between vertices of different colors except (x , y) and (y , x). Since (G , σ) contains no hourglass, and G is a completion of G, i.e., E(G) ⊆ E(G ), we conclude that (G , σ) contains at least one of the arcs (x , y) and (y , x).
In other words, (T * , σ) is obtained from (T, σ) by collapsing every subtree T (u) to a star if u has support leaves of both colors. Proof. The collapsed tree (T * , σ) is well-defined because whenever v ≺ T u, then collapsing the subtree T (v) to a star does not change the set of support leaves S u . Similarly, collapsing T (v) if v is not ≺ T -comparable with u does not change S u . Thus (T * , σ) is uniquely defined. To see that (T * , σ) can be computed in O(|V (T )|) operations, we observe that it suffices to collapse all subtrees T (u) such that u ∈ V 0 (T ) has support leaves of both colors and there is no u ≺ T u with this property, i.e., u is T -maximal in that sense. These vertices u for which T (u) is replaced by a star are found by a top-down traversal of T and evaluating |σ(S u )|, all of which can be computed in linear total time.
We continue by showing the existence of certain arcs in every (not necessarily optimal) completion (G , σ) of (G, σ) to a beBMG. To this end, consider a T -maximal vertex u such that the subtree T (u) is not a star tree and u has support leaves S u of both colors in T . We will make frequent use of the fact that E(G) ⊆ E(G ). We consider the following cases in order to show that all arcs between vertices x, y ∈ L(T (u)) with σ(x) = σ(y) exist in (G , σ): (ii) x ∈ L(T (u)) \ S u and y ∈ S u , and (iii) x, y ∈ L(T (u)) \ S u .
In Case (i), the leaves x and y are both children of u. Together with Cor. 3, this implies (x, y), (y, x) ∈ E(G) ⊆ E(G ).
We will now show that E(G * ) ⊆ E(G ) for every (not necessarily optimal) completion (G , σ) of the 2-BMG (G, σ) to a beBMG. To this end, consider an arbitrary arc (x, y) ∈ E(G * ). If (x, y) ∈ E(G), then (x, y) ∈ E(G ) follows immediately. Now assume that (x, y) ∈ F = E(G * )\E(G). Since (G, σ) is a 2-BMG and thus properly-colored and sink-free (cf. Prop. 4), there must be a vertex y of color σ(y) such that (x, y ) ∈ E(G). Since (x, y) / ∈ E(G), we have lca T (x, y ) ≺ T lca T (x, y) and thus the LRT (T, σ) displays the triple xy |y. However, (x, y), (x, y ) ∈ E(G * ) implies that (T * , σ) does not display the triple xy |y, i.e., all edges on the path from lca T (x, y ) to lca T (x, y) have been contracted. Therefore, there is a T -maximal inner vertex u ∈ V 0 (T ) such that x, y ∈ L(T (u)), T (u) is not a star tree and u has support leaves of both colors in T . By the arguments above, we can conclude that (x, y) ∈ E(G ).
In summary, F is a solution for 2-BMG CBEG with the 2-BMG (G, σ) (and some integer k ≥ |F |) as input, and F ⊆ F for every other solution F = E(G ) \ E(G). Therefore, we conclude that F is the unique optimal solution.
As a direct cosequence of Thm. 2, the fact that LRTs can be constructed in O(|V | + |E| log 2 |V |) (cf. [17]) and Lemma 3, we have We also immediately obtain a characterization of the LRTs of 2-beBMGs.

Concluding Remarks
Starting from the observation that the property of being least resolved is preserved under contraction of inner edges, we have obtained a characterization of the LRTs that explain 2-colored beBMGs. The construction of these "collapsed trees" corresponds to the completion of BMGs to beBMGs, resulting in a simple, polynomial-time algorithm for this problem.
In contrast to the 2-colored case, -BMG CBEG with a BMG as input and ≥ 3 in general does not have a unique optimal solution. In the example in Fig. 2, the missing arcs (a 2 , b 1 ) and (T 2 , σ) no BMG Figure 2: Example for 3-BMG CBEG with the 3-BMG (G, σ) (explained by the LRT (T, σ)) as input that has no unique optimal solution. Insertion of the missing arcs (a2, b1) and (b2, a1) produces a graph that is not a BMG. At least one of the arcs (c, a1) or (c, b1) has to be inserted additionally to obtain the beBMGs (G1, σ) and (G2, σ) (shown with their LRTs (T1, σ) and (T2, σ)), respectively.
(b 2 , a 1 ) in the induced hourglass [a 1 b 1 a 2 b 2 ] must be inserted. The resulting graph is not a BMG. To obtain a BMG, it suffices to insert in addition either the arc (c, a 1 ) or the arc (c, b 1 ) to obtain a beBMG. (cf. Prop. 3).
The simple solution of 2-BMG CBEG begs the question whether other arc modification problems for beBMGs, in particular the corresponding deletion and editing problems, have a similar structure. This does not seem to be case, however. Neither 2-BMG EBEG nor 2-BMG DBEG with a 2-BMG as input have a unique optimal solution. To see this, consider the 2-BMG consisting of the hourglass [xy x y ] which is explained by the unique non-binary tree (x, y, (x , y )) (in Newick format, see also Fig. 1(A)). Deletion of the arcs (x, y) or (y, x) results in a graph that is explained by the binary trees (y, (x, (x , y ))) or (x, (y, (x , y ))), respectively. We suspect that a BMG as input does not make these problems easier than the general case -the complexity of which remains an open questions, however.