# Improved Duplication-Transfer-Loss Reconciliation with Extinct and Unsampled Lineages

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Definitions and Preliminaries

#### 2.1. DTL Reconciliation

**Definition**

**1**

**.**A transfer-loss ($\mathbb{TL}$) event occurs when the descendants of the donor species lose all copies derived from the transferred gene.

**Definition**

**2**

**.**A transfer from unsampled lineage ($\mathbb{TX}$) event occurs when the donor species is not represented in the species tree.

#### 2.2. Transfer-Loss Events

**Definition**

**3**

**.**Given a gene tree $\mathit{G}$, an augmented gene tree ${\mathit{G}}^{\prime}$ is defined to be the tree obtained from $\mathit{G}$ by (i) selecting a subset of edges $A\subseteq E\left(\mathit{G}\right)$, and (ii) subdividing each edge in A by a hidden node such that each edge $(\mathit{g},{\mathit{g}}^{\prime})\in A$ is replaced by the two edges $(\mathit{g},h)$ and $(h,{\mathit{g}}^{\prime})$, where h is a new hidden node.

#### 2.3. Transfers from Unsampled Lineages

**Definition**

**4**

**.**Given the species tree $\mathit{S}$, the augmented species tree ${\mathit{S}}^{\prime}$ is defined to be the tree obtained from $\mathit{S}$ by (i) subdividing each edge in $\mathit{E}\left(S\right)$ by a new extra node such that each edge $(s,{s}^{\prime})\in \mathit{E}\left(S\right)$ is replaced by the two edges $(s,e)$ and $(e,{s}^{\prime})$, where e is the new extra node, and connecting e by an edge to a new extra leaf, and (ii) creating a new root node r and connecting r by an edge to $\mathit{rt}\left(\mathit{S}\right)$ and by another edge to a new extra leaf. Each edge of ${\mathit{S}}^{\prime}$ connecting an extra node with an extra leaf is called an extra edge.

#### 2.4. DTLx Reconciliation

**Definition**

**5**

**.**A DTLx reconciliation for $\mathit{G}$ and $\mathit{S}$ is an eleven-tuple $\langle \mathcal{L},{\mathit{G}}^{\prime},{\mathit{S}}^{\prime},\mathcal{M},\Sigma ,\Delta ,\Theta ,{\Theta}_{L},{\Theta}_{X},\Xi ,\tau \rangle $, where $\mathcal{L}:\mathit{Le}\left(\mathit{G}\right)\stackrel{}{\to}\mathit{Le}\left(\mathit{S}\right)$ represents the leaf mapping from $\mathit{G}$ to $\mathit{S}$, ${\mathit{G}}^{\prime}$ denotes the augmented gene tree, ${\mathit{S}}^{\prime}$ denotes the augmented species tree, $\mathcal{M}$: $\mathit{V}\left({\mathit{G}}^{\prime}\right)\to \mathit{V}\left({\mathit{S}}^{\prime}\right)$ maps each node of ${\mathit{G}}^{\prime}$ to a node of ${\mathit{S}}^{\prime}$, the sets Σ, Δ, Θ, ${\Theta}_{L}$, and ${\Theta}_{X}$ partition $\mathit{I}\left({\mathit{G}}^{\prime}\right)$ into speciation, duplication, $\mathbb{T}$, $\mathbb{TL}$, and $\mathbb{TX}$ events, respectively; Ξ is the subset of $\mathit{E}\left({\mathit{G}}^{\prime}\right)$ that represents transfer edges, and $\tau :\Theta \cup {\Theta}_{L}\cup {\Theta}_{X}\stackrel{}{\to}\mathit{V}\left({\mathit{S}}^{\prime}\right)$ specifies the recipient species for each transfer event, subject to the following constraints:

**Augmented gene tree constraint**

- $\mathit{G}$ can be obtained from ${\mathit{G}}^{\prime}$ by suppressing each node of ${\mathit{G}}^{\prime}$ with exactly one child.

**Mapping constraints**

- 2.
- If $\mathit{g}\in \mathit{Le}\left({\mathit{G}}^{\prime}\right)$, then $\mathcal{M}\left(\mathit{g}\right)$ = $\mathcal{L}\left(\mathit{g}\right)$.
- 3.
- If $\mathit{g}\in \mathit{I}\left(\mathit{G}\right)\cap \mathit{I}\left({\mathit{G}}^{\prime}\right)$ (i.e., g is not a hidden node of ${\mathit{G}}^{\prime}$) and ${\mathit{g}}^{\prime}$ and ${\mathit{g}}^{\u2033}$ denote the children of $\mathit{g}$ in ${\mathit{G}}^{\prime}$, then,
- (a)
- $\mathcal{M}\left(\mathit{g}\right){\overline{)<}}_{{\mathit{S}}^{\prime}}\mathcal{M}\left({\mathit{g}}^{\prime}\right)$, and $\mathcal{M}\left(\mathit{g}\right){\overline{)<}}_{{\mathit{S}}^{\prime}}\mathcal{M}\left({\mathit{g}}^{\u2033}\right)$,
- (b)
- At least one of $\mathcal{M}\left({\mathit{g}}^{\prime}\right)$ and $\mathcal{M}\left({\mathit{g}}^{\u2033}\right)$ is a descendant of $\mathcal{M}\left(\mathit{g}\right)$.

- 4.
- If $\mathit{g}\in \mathit{I}\left({\mathit{G}}^{\prime}\right)\setminus \mathit{I}\left(\mathit{G}\right)$ (i.e., g is a hidden node of ${\mathit{G}}^{\prime}$) and ${\mathit{g}}^{\prime}$ denotes it’s unique child, then $\mathcal{M}\left(\mathit{g}\right)$ and $\mathcal{M}\left({\mathit{g}}^{\prime}\right)$ are incomparable.

**Event constraints**

- 5.
- Given any edge $(\mathit{g},{\mathit{g}}^{\prime})\in \mathit{E}\left({\mathit{G}}^{\prime}\right)$, $(\mathit{g},{\mathit{g}}^{\prime})\in \Xi $ if and only if $\mathcal{M}\left(\mathit{g}\right)$ and $\mathcal{M}\left({\mathit{g}}^{\prime}\right)$ are incomparable.
- 6.
- If $\mathit{g}\in \mathit{I}\left(\mathit{G}\right)\cap \mathit{I}\left({\mathit{G}}^{\prime}\right)$ and ${\mathit{g}}^{\prime}$ and ${\mathit{g}}^{\u2033}$ denote the children of $\mathit{g}$ in ${\mathit{G}}^{\prime}$, then,
- (a)
- $\mathit{g}\in \Sigma $ only if $\mathcal{M}\left(\mathit{g}\right)=\mathit{lca}(\mathcal{M}\left({\mathit{g}}^{\prime}\right),\mathcal{M}\left({\mathit{g}}^{\u2033}\right))$ and $\mathcal{M}\left({\mathit{g}}^{\prime}\right)$ and $\mathcal{M}\left({\mathit{g}}^{\u2033}\right)$ are incomparable,
- (b)
- $\mathit{g}\in \Delta $ only if $\mathcal{M}\left(\mathit{g}\right){\ge}_{{\mathit{S}}^{\prime}}\mathit{lca}(\mathcal{M}\left({\mathit{g}}^{\prime}\right)$, $\mathcal{M}\left({\mathit{g}}^{\u2033}\right))$,
- (c)
- $\mathit{g}\in \Theta $ if and only if either $(\mathit{g},{\mathit{g}}^{\prime})\in \Xi $ or $(\mathit{g},{\mathit{g}}^{\u2033})\in \Xi $,
- (d)
- If $\mathit{g}\in \Theta $ and $(\mathit{g},{\mathit{g}}^{\prime})\in \Xi $, then $\mathcal{M}\left(\mathit{g}\right)$ and $\tau \left(\mathit{g}\right)$ must be incomparable and $\mathcal{M}\left({\mathit{g}}^{\prime}\right)$ must be a descendant of $\tau \left(\mathit{g}\right)$, i.e., $\mathcal{M}\left({\mathit{g}}^{\prime}\right){\le}_{{\mathit{S}}^{\prime}}\tau \left(\mathit{g}\right)$.

- 7.
- If $\mathit{g}\in \mathit{I}\left({\mathit{G}}^{\prime}\right)\setminus \mathit{I}\left(\mathit{G}\right)$ and ${\mathit{g}}^{\prime}$ denotes it’s unique child, then,
- (a)
- $\mathit{g}\in {\Theta}_{L}\cup {\Theta}_{X}$, and $(\mathit{g},{\mathit{g}}^{\prime})\in \Xi $
- (b)
- $\mathcal{M}\left(\mathit{g}\right)$ and $\tau \left(\mathit{g}\right)$ are incomparable, and $\mathcal{M}\left({\mathit{g}}^{\prime}\right){\le}_{{\mathit{S}}^{\prime}}\tau \left(\mathit{g}\right)$
- (c)
- $\mathit{g}\in {\Theta}_{X}$ if and only if $\mathcal{M}\left(\mathit{g}\right)\in \mathcal{X}\left({\mathit{S}}^{\prime}\right)$.

**Definition**

**6**

**.**Given a DTLx reconciliation $\alpha =\langle \mathcal{L},{\mathit{G}}^{\prime},{\mathit{S}}^{\prime},\mathcal{M},\Sigma ,\Delta ,\Theta ,{\Theta}_{L},{\Theta}_{X},\Xi ,\tau \rangle $ for $\mathit{G}$ and $\mathit{S}$, let $\mathit{g}\in \mathit{V}\left({\mathit{G}}^{\prime}\right)$ and $\{{\mathit{g}}^{\prime},{\mathit{g}}^{\u2033}\}=\mathit{Ch}\left(\mathit{g}\right)$ if $\mathit{g}\in \mathit{I}\left(\mathit{G}\right)\cap \mathit{I}\left({\mathit{G}}^{\prime}\right)$, and $\left\{{\mathit{g}}^{\prime}\right\}=\mathit{Ch}\left(\mathit{g}\right)$ otherwise. The minimum number of losses $Los{s}_{\alpha}\left(\mathit{g}\right)$ at node $\mathit{g}$ (or, more accurately, the minimum number of losses incurred along the child edge(s) of $\mathit{g}$) is defined to be:

- ${d}_{{\mathit{S}}^{\prime}}(\mathcal{M}\left(\mathit{g}\right),\mathcal{M}\left({\mathit{g}}^{\prime}\right))+{d}_{{\mathit{S}}^{\prime}}(\mathcal{M}\left(\mathit{g}\right),\left(\mathcal{M}\left({\mathit{g}}^{\u2033}\right)\right)-2,\mathrm{if}\hspace{0.17em}\mathit{g}\in \Sigma $;
- ${d}_{{\mathit{S}}^{\prime}}(\mathcal{M}\left(\mathit{g}\right),\mathcal{M}\left({\mathit{g}}^{\prime}\right))+{d}_{{\mathit{S}}^{\prime}}(\mathcal{M}\left(\mathit{g}\right),\mathcal{M}\left({\mathit{g}}^{\u2033}\right))\hspace{0.17em}\mathrm{if}\hspace{0.17em}\mathit{g}\in \Delta $;
- ${d}_{{\mathit{S}}^{\prime}}(\mathcal{M}\left(\mathit{g}\right),\mathcal{M}\left({\mathit{g}}^{\u2033}\right))+{d}_{{\mathit{S}}^{\prime}}(\tau \left(\mathit{g}\right),\mathcal{M}\left({\mathit{g}}^{\prime}\right))\mathrm{if}(\mathit{g},{\mathit{g}}^{\prime})\in \Xi \mathrm{and}\mathit{g}\in \Theta ,\mathrm{and}$;
- ${d}_{{\mathit{S}}^{\prime}}(\tau \left(\mathit{g}\right),\mathcal{M}\left({\mathit{g}}^{\prime}\right)\mathrm{if}(\mathit{g},{\mathit{g}}^{\prime})\in \Xi ,\mathrm{and}\hspace{0.17em}\mathit{g}\in {\Theta}_{L}\mathrm{or}\hspace{0.17em}\mathit{g}\in {\Theta}_{X}$.

**Definition**

**7**

**.**Given a DTLx reconciliation $\alpha =\langle \mathcal{L},{\mathit{G}}^{\prime},{\mathit{S}}^{\prime},\mathcal{M},\Sigma ,\Delta ,\Theta ,{\Theta}_{L},{\Theta}_{X},\Xi ,\tau \rangle $ for $\mathit{G}$ and $\mathit{S}$, the reconciliation cost associated with α is the total cost of all events invoked by α. Specifically, the reconciliation cost is given by ${P}_{\Delta}\times |\Delta |+{P}_{\Theta}\times |\Theta |+{P}_{\mathbb{TL}}\times |{\Theta}_{L}|+{P}_{\mathbb{TX}}\times \left|{\Theta}_{X}\right|+{P}_{loss}\times {\mathrm{Loss}}_{\alpha}$.

**Problem**

**1**

**.**Given $\mathit{G}$ and $\mathit{S}$, along with ${P}_{\Delta}$, ${P}_{\Theta}$, ${P}_{\mathbb{TL}}$, ${P}_{\mathbb{TX}}$, and ${P}_{loss}$, the O-DTLx problem is to find a DTLx reconciliation for $\mathit{G}$ and $\mathit{S}$ with minimum reconciliation cost.

**Problem**

**2**

**.**Given $\mathit{G}$ and $\mathit{S}$, along with ${P}_{\Delta}$, ${P}_{\Theta}$, ${P}_{\mathbb{TL}}$, ${P}_{\mathbb{TX}}$, and ${P}_{loss}$, let $\mathcal{O}$ denote the set of all optimal DTLx reconciliations for $\mathit{G}$ and $\mathit{S}$ (i.e., with minimum reconciliation cost). The O-DTLx-Sampling problem is to compute an optimal DTLx reconciliation from $\mathcal{O}$ in such a way that each DTLx reconciliation in $\mathcal{O}$ has an equal probability of being computed.

## 3. Materials and Methods

**Definition**

**8**

**.**Given a gene tree $\mathit{G}$, the fully augmented gene tree, denoted ${\mathit{G}}^{\u2033}$, is defined to be the tree obtained from $\mathit{G}$ by subdividing each edge in $\mathit{E}\left(\mathit{G}\right)$ by a hidden node such that each edge $(\mathit{g},{\mathit{g}}^{\prime})\in \mathit{E}\left(\mathit{G}\right)$ is replaced by the two edges $(\mathit{g},h)$ and $(h,{\mathit{g}}^{\prime})$, where h is a new hidden node.

#### 3.1. An $O\left(mn\right)$-Time Algorithm for O-DTLx

**Theorem**

**1.**

**Proof.**

Algorithm 1 Compute-O-DTLx($\mathit{G},\mathit{S},\mathcal{L},{P}_{\Sigma},{P}_{\Delta},{P}_{\Theta},{P}_{\mathbb{TL}},{P}_{\mathbb{TX}},{P}_{loss}$) |

1: Initialize ${\mathit{G}}^{\u2033}$ and ${\mathit{S}}^{\prime}$ as outlined earlier. |

2: for each $\mathit{g}\in \mathit{V}\left({\mathit{G}}^{\u2033}\right)$ and $\mathit{s}\in \mathit{V}\left({\mathit{S}}^{\prime}\right)$ do |

3: Initialize $c(\mathit{g},\mathit{s})$, $\mathit{in}(\mathit{g},\mathit{s})$, $\mathit{out}(\mathit{g},\mathit{s})$, and $\mathit{inAlt}(\mathit{g},\mathit{s})$ to ∞. |

4: if $\mathit{g}\in \mathit{I}\left({\mathit{G}}^{\u2033}\right)\cap \mathit{I}\left(\mathit{G}\right)$ then |

5: Initialize ${c}_{\Sigma}(\mathit{g},\mathit{s})$, ${c}_{\Delta}(\mathit{g},\mathit{s})$, and ${c}_{\Theta}(\mathit{g},\mathit{s})$ to ∞. |

6: if $\mathit{g}\in \mathit{I}\left({\mathit{G}}^{\u2033}\right)\setminus \mathit{I}\left(\mathit{G}\right)$ then |

7: Initialize ${c}_{{\Theta}_{L}}(\mathit{g},\mathit{s})$ and ${c}_{{\Theta}_{X}}(\mathit{g},\mathit{s})$ to ∞. |

8: for each $\mathit{g}\in \mathit{Le}\left({\mathit{G}}^{\u2033}\right)$ do |

9: Initialize $c(\mathit{g},\mathcal{L}(\mathit{g}\left)\right)$ to 0 |

10: For each $\mathit{s}{\ge}_{{\mathit{S}}^{\prime}}\mathcal{L}\left(\mathit{g}\right)$, initialize $\mathit{in}(\mathit{g},\mathit{s})$ to ${P}_{loss}\xb7{d}_{{\mathit{S}}^{\prime}}(\mathit{s},\mathcal{L}\left(\mathit{g}\right))$ and $\mathit{inAlt}(\mathit{g},\mathit{s})$ to 0. |

11: For each $\mathit{s}\in V\left({\mathit{S}}^{\prime}\right)$ incomparable to $\mathcal{L}\left(\mathit{g}\right)$, assign $\mathit{out}(\mathit{g},\mathit{s})=0$. |

12: for each $\mathit{g}\in \mathit{I}\left({\mathit{G}}^{\u2033}\right)$ in post-order do |

13: for each $\mathit{s}\in \mathit{V}\left({\mathit{S}}^{\prime}\right)$ in post-order do |

14: if $\mathit{g}\in \mathit{I}\left({\mathit{G}}^{\u2033}\right)\setminus \mathit{I}\left(\mathit{G}\right)$ then |

15: Let ${\mathit{g}}^{\prime}$ denote the unique child of $\mathit{g}$ in ${\mathit{G}}^{\u2033}$. |

16: Compute ${c}_{{\Theta}_{L}}(\mathit{g},\mathit{s})$ and ${c}_{{\Theta}_{X}}(\mathit{g},\mathit{s})$ according to Equations (6) and (7), respectively. |

17: Compute $c(\mathit{g},\mathit{s})$ according to Equation (2). |

18: if $\mathit{s}\in \mathit{Le}\left({\mathit{S}}^{\prime}\right)$ then |

19: $\mathit{in}(\mathit{g},\mathit{s})=\mathit{inAlt}(\mathit{g},\mathit{s})=c(\mathit{g},\mathit{s})$. |

20: else |

21: $\mathit{inAlt}(\mathit{g},\mathit{s})=min\{c(\mathit{g},\mathit{s}),\mathit{inAlt}(\mathit{g},{\mathit{s}}^{\prime}),\mathit{inAlt}(\mathit{g},{\mathit{s}}^{\u2033})\}$. |

22: If s is an extra node then $\mathit{in}(\mathit{g},\mathit{s})=min\{c(\mathit{g},\mathit{s}),\mathit{in}(\mathit{g},{\mathit{s}}^{\prime})+{P}_{loss},\mathit{in}(\mathit{g},{\mathit{s}}^{\u2033})+{P}_{loss}\}$. If s is not an extra node then $\mathit{in}(\mathit{g},\mathit{s})=min\{c(\mathit{g},\mathit{s}),\mathit{in}(\mathit{g},{\mathit{s}}^{\prime}),\mathit{in}(\mathit{g},{\mathit{s}}^{\u2033})\}$. |

23: if $\mathit{g}\in \mathit{I}\left({\mathit{G}}^{\u2033}\right)\cap \mathit{I}\left(\mathit{G}\right)$ then |

24: Let $\{{\mathit{g}}^{\prime},{\mathit{g}}^{\u2033}\}={\mathit{Ch}}_{{\mathit{G}}^{\prime}}\left(\mathit{g}\right)$. |

25: If $\mathit{s}\notin \mathit{Le}\left({\mathit{S}}^{\prime}\right)$ then let $\{{\mathit{s}}^{\prime},{\mathit{s}}^{\u2033}\}={\mathit{Ch}}_{{\mathit{S}}^{\prime}}\left(\mathit{s}\right)$. |

26: Compute ${c}_{\Sigma}(\mathit{g},\mathit{s})$, ${c}_{\Delta}(\mathit{g},\mathit{s})$, and ${c}_{\Theta}(\mathit{g},\mathit{s})$ according to Equations (3), (4), and (5), respectively. |

27: Compute $c(\mathit{g},\mathit{s})$ according to Equation (2). |

28: if $\mathit{s}\in \mathit{Le}\left({\mathit{S}}^{\prime}\right)$ then |

29: $\mathit{in}(\mathit{g},\mathit{s})=\mathit{inAlt}(\mathit{g},\mathit{s})=c(\mathit{g},\mathit{s})$. |

30: else |

31: $\mathit{inAlt}(\mathit{g},\mathit{s})=min\{\mathit{c}(\mathit{g},\mathit{s}),\mathit{inAlt}(\mathit{g},{\mathit{s}}^{\prime}),\mathit{inAlt}(\mathit{g},{\mathit{s}}^{\u2033})\}$. |

32: If s is an extra node then $\mathit{in}(\mathit{g},\mathit{s})=min\{c(\mathit{g},\mathit{s}),\mathit{in}(\mathit{g},{\mathit{s}}^{\prime})+{P}_{loss},\mathit{in}(\mathit{g},{\mathit{s}}^{\u2033})+{P}_{loss}\}$. If s is not an extra node then $\mathit{in}(\mathit{g},\mathit{s})=min\{c(\mathit{g},\mathit{s}),\mathit{in}(\mathit{g},{\mathit{s}}^{\prime}),\mathit{in}(\mathit{g},{\mathit{s}}^{\u2033})\}$. |

33: for each $\mathit{s}\in \mathit{I}\left({\mathit{S}}^{\prime}\right)$ in pre-order do |

34: Let $\{{\mathit{s}}^{\prime},{\mathit{s}}^{\u2033}\}={\mathit{Ch}}_{{\mathit{S}}^{\prime}}\left(\mathit{s}\right)$. |

35: $\mathit{out}(\mathit{g},{\mathit{s}}^{\prime})=min\{\mathit{out}(\mathit{g},\mathit{s}),\mathit{inAlt}(\mathit{g},{\mathit{s}}^{\u2033})\}$, and $\mathit{out}(\mathit{g},{\mathit{s}}^{\u2033})=min\left\{\mathit{out}(\mathit{g},\mathit{s})\mathit{inAlt}(\mathit{g},{\mathit{s}}^{\prime})\right\}$. |

36: Return ${min}_{\mathit{s}\in \mathit{V}\left({\mathit{S}}^{\prime}\right)}c(\mathit{rt}\left({\mathit{G}}^{\u2033}\right),\mathit{s})$. |

#### 3.2. An $O\left(m{n}^{2}\right)$-Time Algorithm for O-DTLx-Sampling

**Theorem**

**2.**

#### 3.3. Assigning Event Costs for $\mathbb{TL}$ and $\mathbb{TX}$

## 4. Results

#### 4.1. Results on Simulated Datasets

#### 4.2. Biological Data

## 5. Discussion and Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## References

- Tofigh, A. Using Trees to Capture Reticulate Evolution: Lateral Gene Transfers and Cancer Progression. Ph.D. Thesis, KTH Royal Institute of Technology, Stockholm, Sweden, 2009. [Google Scholar]
- Gorbunov, K.Y.; Liubetskii, V.A. Reconstructing genes evolution along a species tree. Molekuliarnaia Biologiia
**2009**, 43, 946–958. [Google Scholar] [CrossRef] [PubMed] - Doyon, J.P.; Scornavacca, C.; Gorbunov, K.Y.; Szöllosi, G.J.; Ranwez, V.; Berry, V. An Efficient Algorithm for Gene/Species Trees Parsimonious Reconciliation with Losses, Duplications and Transfers. In Research in Computational Molecular Biology—Comparative Genomics; Tannier, E., Ed.; Springer: Berlin/Heidelberg, Germany, 2010; Volume 6398, pp. 93–108. [Google Scholar]
- Tofigh, A.; Hallett, M.T.; Lagergren, J. Simultaneous Identification of Duplications and Lateral Gene Transfers. IEEE/ACM Trans. Comput. Biol. Bioinform.
**2011**, 8, 517–535. [Google Scholar] [CrossRef] - David, L.A.; Alm, E.J. Rapid evolutionary innovation during an Archaean genetic expansion. Nature
**2011**, 469, 93–96. [Google Scholar] [CrossRef] [Green Version] - Chen, Z.Z.; Deng, F.; Wang, L. Simultaneous Identification of Duplications, Losses, and Lateral Gene Transfers. IEEE/ACM Trans. Comput. Biol. Bioinform.
**2012**, 9, 1515–1528. [Google Scholar] [CrossRef] [Green Version] - Bansal, M.S.; Alm, E.J.; Kellis, M. Efficient algorithms for the reconciliation problem with gene duplication, horizontal transfer and loss. Bioinformatics
**2012**, 28, 283–291. [Google Scholar] [CrossRef] - Stolzer, M.; Lai, H.; Xu, M.; Sathaye, D.; Vernot, B.; Durand, D. Inferring duplications, losses, transfers and incomplete lineage sorting with nonbinary species trees. Bioinformatics
**2012**, 28, 409–415. [Google Scholar] [CrossRef] [Green Version] - Szollosi, G.J.; Boussau, B.; Abby, S.S.; Tannier, E.; Daubin, V. Phylogenetic modeling of lateral gene transfer reconstructs the pattern and relative timing of speciations. Proc. Natl. Acad. Sci. USA
**2012**, 109, 17513–17518. [Google Scholar] [CrossRef] [Green Version] - Szollosi, G.J.; Tannier, E.; Lartillot, N.; Daubin, V. Lateral Gene Transfer from the Dead. Syst. Biol.
**2013**, 62, 386–397. [Google Scholar] [CrossRef] - Bansal, M.S.; Alm, E.J.; Kellis, M. Reconciliation Revisited: Handling Multiple Optima when Reconciling with Duplication, Transfer, and Loss. J. Comput. Biol.
**2013**, 20, 738–754. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Scornavacca, C.; Paprotny, W.; Berry, V.; Ranwez, V. Representing a Set of Reconciliations in a Compact Way. J. Bioinform. Comput. Biol.
**2013**, 11, 1250025. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Libeskind-Hadas, R.; Wu, Y.C.; Bansal, M.S.; Kellis, M. Pareto-optimal phylogenetic tree reconciliation. Bioinformatics
**2014**, 30, i87–i95. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Sjostrand, J.; Tofigh, A.; Daubin, V.; Arvestad, L.; Sennblad, B.; Lagergren, J. A Bayesian Method for Analyzing Lateral Gene Transfer. Syst. Biol.
**2014**, 63, 409–420. [Google Scholar] [CrossRef] [Green Version] - Scornavacca, C.; Jacox, E.; Szöllosi, G.J. Joint amalgamation of most parsimonious reconciled gene trees. Bioinformatics
**2015**, 31, 841–848. [Google Scholar] [CrossRef] [Green Version] - Jacox, E.; Chauve, C.; Szollosi, G.J.; Ponty, Y.; Scornavacca, C. ecceTERA: Comprehensive gene tree-species tree reconciliation using parsimony. Bioinformatics
**2016**, 32, 2056. [Google Scholar] [CrossRef] - Bansal, M.S.; Kellis, M.; Kordi, M.; Kundu, S. RANGER-DTL 2.0: Rigorous reconstruction of gene-family evolution by duplication, transfer and loss. Bioinformatics
**2018**, 34, 3214–3216. [Google Scholar] [CrossRef] - Kordi, M.; Bansal, M.S. Exact Algorithms for Duplication-Transfer-Loss Reconciliation with Non-Binary Gene Trees. IEEE/ACM Trans. Comput. Biol. Bioinform.
**2019**, 16, 1077–1090. [Google Scholar] [CrossRef] - Merkle, D.; Middendorf, M.; Wieseke, N. A parameter-adaptive dynamic programming approach for inferring cophylogenies. BMC Bioinform.
**2010**, 11, S60. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Conow, C.; Fielder, D.; Ovadia, Y.; Libeskind-Hadas, R. Jane: A new tool for the cophylogeny reconstruction problem. Algorithms Mol. Biol.
**2010**, 5, 16. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Donati, B.; Baudet, C.; Sinaimeri, B.; Crescenzi, P.; Sagot, M.F. EUCALYPT: Efficient tree reconciliation enumerator. Algorithms Mol. Biol.
**2015**, 10, 3. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Santichaivekin, S.; Yang, Q.; Liu, J.; Mawhorter, R.; Jiang, J.; Wesley, T.; Wu, Y.C.; Libeskind-Hadas, R. eMPRess: A systematic cophylogeny reconciliation tool. Bioinformatics
**2020**, btaa978. [Google Scholar] [CrossRef] [PubMed] - Williams, D.; Gogarten, J.P.; Papke, R.T. Quantifying Homologous Replacement of Loci between Haloarchaeal Species. Genome Biol. Evol.
**2012**, 4, 1223–1244. [Google Scholar] [CrossRef] [Green Version] - Ovadia, Y.; Fielder, D.; Conow, C.; Libeskind-Hadas, R. The Cophylogeny Reconstruction Problem Is NP-Complete. J. Comput. Biol.
**2011**, 18, 59–65. [Google Scholar] [CrossRef] - Libeskind-Hadas, R.; Charleston, M. On the Computational Complexity of the Reticulate Cophylogeny Reconstruction Problem. J. Comput. Biol.
**2009**, 16, 105–117. [Google Scholar] [CrossRef] - Hasić, D.; Tannier, E. Gene tree reconciliation including transfers with replacement is NP-hard and FPT. J. Comb. Optim.
**2019**, 38, 502–544. [Google Scholar] [CrossRef] [Green Version] - Kordi, M.; Kundu, S.; Bansal, M.S. On Inferring Additive and Replacing Horizontal Gene Transfers Through Phylogenetic Reconciliation. In Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, Niagara Falls, NY, USA, 7–10 September 2019; pp. 514–523. [Google Scholar] [CrossRef] [Green Version]
- Zhaxybayeva, O.; Gogarten, J.P. Horizontal gene transfer, gene histories, and the root of the tree of life. In Planetary Systems and the Origins of Life; Pudritz, R., Higgs, P., Stone, J., Eds.; Cambridge Astrobiology; Cambridge University Press: Cambridge, UK, 2007; pp. 178–192. [Google Scholar] [CrossRef]
- Davín, A.A.; Tricou, T.; Tannier, E.; de Vienne, D.N.; Szollosi, G.J. Zombi: A phylogenetic simulator of trees, genomes and sequences that accounts for dead linages. Bioinformatics
**2019**, 36, 1286–1288. [Google Scholar] [CrossRef] [PubMed] - Bansal, M.S.; Wu, Y.C.; Alm, E.J.; Kellis, M. Improved gene tree error correction in the presence of horizontal gene transfer. Bioinformatics
**2015**, 31, 1211–1218. [Google Scholar] [CrossRef] [PubMed] [Green Version]

**Figure 1.**Illustration of the difference between (

**a**) a standard transfer event and (

**b**) a $\mathbb{TL}$ event.

**Figure 3.**Representation of unsampled lineages on the species tree. (

**a**) Illustration of the “coral” of life based on drawings by Darwin, where black lines represent lineages leading to extant species, and grey lines represent potential unsampled species; adapted from [28]. (

**b**) Input species tree before augmentation. (

**c**) Augmented species tree. Extra nodes are in green and extra leaves are in white. Leaves labeled A-D represent extant species. Unsampled lineages are represented by the edges connecting extra nodes and extra leaves.

**Figure 4.**Impact of $\mathbb{TL}$ and $\mathbb{TX}$ events on optimal DTLx reconciliations. (

**a**) and (

**b**) are the input gene tree $\mathit{G}$ and input species tree $\mathit{S}$, respectively. (

**c**) An augmented version of $\mathit{G}$ containing a single hidden node as required by the reconciliations shown in Parts (

**e**,

**f**). (

**d**) An optimal DTL reconciliation of $\mathit{G}$ and $\mathit{S}$ illustrating how the gene tree (blue lines) may have evolved within the species tree (tubes) without the use of $\mathbb{TL}$ or $\mathbb{TX}$ events. This DTL reconciliation invokes two transfer events and two losses. All reconciliations shown in this figure use event costs of 1, 2, 3, 4, and 3 for losses, duplications, transfers, $\mathbb{TL}$, and $\mathbb{TX}$ events, respectively. Thus, we get a total reconciliation cost of 8 for this optimal DTL reconciliation. (

**e**) An optimal reconciliation of G and S that uses $\mathbb{TL}$ events but not $\mathbb{TX}$ events. This reconciliation utilizes the hidden node h1 in ${\mathit{G}}^{\prime}$ to facilitate a $\mathbb{TL}$ event from species E to species B. It invokes one transfer event and one $\mathbb{TL}$ event, for a total reconciliation cost of 7. (

**f**) An optimal DTLx reconciliation of G and S. This DTLx reconciliation utilizes the hidden node h1 in ${\mathit{G}}^{\prime}$ to facilitate a $\mathbb{TX}$ event. It invokes one transfer event and one $\mathbb{TX}$ event, for a total reconciliation cost of 6. Accordingly, the reconciliation shown in (

**f**) makes use of an extra leaf and extra node on ${\mathit{S}}^{\prime}$ (dotted green tube).

**Figure 5.**Impact of increasing support values on $\mathbb{TX}$ event accuracy. The support for an event is defined as the percentage of optimal solution space that the event appears in.

Gene Tree Parameters | Species Tree Parameters | |||||
---|---|---|---|---|---|---|

Dataset | Duplication Rate | Transfer Rate | Loss Rate | Birth Rate | Extinction Rate | # Taxa (Leaves) |

1 | 0.022 | 0.04 | 0.01 | 0.1 | 0.025 | 50 |

2 | 0.020 | 0.06 | 0.008 | 0.1 | 0.032 | 50 |

**Table 2.**Overall reconciliation accuracy. Average reconciliation accuracy, in terms of mapping accuracy and event assignment accuracy, is shown for each method. Event/mapping accuracy for a given input gene tree is calculated as the total number of correct events/mappings divided by the total number of internal nodes in that gene tree. The table also shows the accuracy of inferred recipients for transfer events. This recipient accuracy for each gene tree is calculated as the total number of correctly identified recipients divided by the total number of correctly identified transfers. Results are averaged across the 500 gene tree/species tree pairs in each dataset.

Dataset-1 | |||

Event Accuracy | Mapping Accuracy | Recipient Accuracy | |

RANGER-DTL | 0.961 | 0.937 | 0.682 |

RANGER-DTLx, ${P}_{\mathbb{TX}}=4$ | 0.961 | 0.939 | 0.697 |

RANGER-DTLx, ${P}_{\mathbb{TX}}=3$ | 0.917 | 0.895 | 0.716 |

ecceTERA | 0.961 | 0.937 | 0.704 |

Dataset-2 | |||

Event Accuracy | Mapping Accuracy | Recipient Accuracy | |

RANGER-DTL | 0.944 | 0.912 | 0.623 |

RANGER-DTLx, ${P}_{\mathbb{TX}}=4$ | 0.944 | 0.912 | 0.637 |

RANGER-DTLx, ${P}_{\mathbb{TX}}=3$ | 0.894 | 0.861 | 0.670 |

ecceTERA | 0.945 | 0.910 | 0.650 |

**Table 3.**Accuracy of $\mathbb{TX}$ event detection. The precision and recall for $\mathbb{TX}$ events inferred by each method are shown. A single optimal reconciliation is used per gene tree, and results are aggregated across all 500 gene tree/species tree pairs in each dataset.

Dataset-1 | Dataset-2 | |||||
---|---|---|---|---|---|---|

$\mathbb{TX}$s Returned | Recall | Precision | $\mathbb{TX}$s Returned | Recall | Precision | |

RANGER-DTLx, ${P}_{\mathbb{TX}}=4$ | 29 | 0.33% | 17.24% | 51 | 0.52% | 29.41% |

RANGER-DTLx, ${P}_{\mathbb{TX}}=3$ | 4515 | 30.12% | 9.97% | 6740 | 29.94% | 12.73% |

ecceTERA | 413 | 2.61% | 9.44% | 668 | 3.16% | 13.62% |

**Table 4.**Number of $\mathbb{TX}$ events inferred by the different methods, along with their precision, are shown for different minimum support value cutoffs. Results are aggregated across all 500 gene tree/species tree pairs in each dataset and presented in the form $a/b$, where b is the total number of distinct $\mathbb{TX}$ events inferred across 100 randomly sampled optimal reconciliations for RANGER-DTLx and across the entire optimal solution space for ecceTERA for each gene tree, and a is the number of these $\mathbb{TX}$ events that are correct.

Dataset-1 | Dataset-2 | |||||
---|---|---|---|---|---|---|

Support | RANGER-DTLx,${\mathit{P}}_{\mathbb{TX}}=\mathbf{4}$ | RANGER-DTLx,${\mathit{P}}_{\mathbb{TX}}=\mathbf{3}$ | ecceTERA | RANGER-DTLx,${\mathit{P}}_{\mathbb{TX}}=\mathbf{4}$ | RANGER-DTLx,${\mathit{P}}_{\mathbb{TX}}=\mathbf{3}$ | ecceTERA |

>0 | $14/77$ | $857/9688$ | $926/9285$ | $34/128$ | $1475/$14,134 | $1554/$12,884 |

≥25% | $12/61$ | $818/7869$ | $885/8825$ | $27/84$ | $1416/$11,141 | $1511/$12,252 |

≥50% | $4/10$ | $445/4036$ | $725/6977$ | $13/34$ | $874/5807$ | $1257/9492$ |

≥75% | $1/5$ | $191/1018$ | $242/1427$ | $8/24$ | $473/1932$ | $582/2651$ |

=100% | $1/5$ | $91/503$ | $199/1140$ | $7/23$ | $248/940$ | $468/2088$ |

**Table 5.**Accuracy of placing unsampled species donors on the species tree. All $\mathbb{TX}$ events inferred by each method, regardless of support, for which the recipient species was identified correctly were used for this analysis. The inferred unsampled donor is considered an exact match if the location of the donor lineage on the species tree is the same as the true location of the donor. Numbers for $\mathbb{TX}$ events considered and exact matches are aggregated across all 500 gene tree/species tree pairs in each dataset. The distance between the inferred and true locations of the donor is defined to be the number of edges between the inferred donor lineage and the true lineage on the full species tree (including extinct lineages) as simulated by Zombi.

Dataset-1 | Dataset-2 | |||||
---|---|---|---|---|---|---|

# $\mathbb{TX}$ Considered | Exact Matches | Average Distance | # $\mathbb{TX}$ Considered | Exact Matches | Average Distance | |

RANGER-DTLx Cost 4 | 14 | 3 (21.4%) | 1.71 | 34 | 13 (38.2%) | 2.17 |

RANGER-DTLx Cost 3 | 857 | 473 (55.2%) | 1.52 | 1475 | 689 (46.7%) | 2.03 |

ecceTERA | 926 | 23 (2.5%) | 4.90 | 1554 | 36 (2.8%) | 6.21 |

**Table 6.**Number of biological dataset gene trees that result in additional co-optimal reconciliations invoking $\mathbb{TL}$ and/or $\mathbb{TX}$ events. These results are based on event costs of 1, 2, 3, 4, and 4, for losses, duplications, transfers, $\mathbb{TL}$, and $\mathbb{TX}$, respectively.

Dataset Size | Total # Gene Trees | Gene Trees with Additional Co-Optimal Solutions |
---|---|---|

<50 taxa | 3474 | 1350 (38.9%) |

50–100 taxa | 765 | 667 (87.2%) |

>100 taxa | 308 | 292 (94.8%) |

Dataset Size | RANGER-DTL | RANGER-DTLx | ecceTERA |
---|---|---|---|

<50 leaves | <1 s | $2.5$ s | <1 s |

50–100 leaves | <1 s | $12.4$ s | <1 s |

>100 leaves | <1 s | $24.5$ s | <1 s |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Weiner, S.; Bansal, M.S.
Improved Duplication-Transfer-Loss Reconciliation with Extinct and Unsampled Lineages. *Algorithms* **2021**, *14*, 231.
https://doi.org/10.3390/a14080231

**AMA Style**

Weiner S, Bansal MS.
Improved Duplication-Transfer-Loss Reconciliation with Extinct and Unsampled Lineages. *Algorithms*. 2021; 14(8):231.
https://doi.org/10.3390/a14080231

**Chicago/Turabian Style**

Weiner, Samson, and Mukul S. Bansal.
2021. "Improved Duplication-Transfer-Loss Reconciliation with Extinct and Unsampled Lineages" *Algorithms* 14, no. 8: 231.
https://doi.org/10.3390/a14080231