Predicting the Evolution of Syntenies—An Algorithmic Review

Syntenies are genomic segments of consecutive genes identified by a certain conservation in gene content and order. The notion of conservation may vary from one definition to another, the more constrained requiring identical gene contents and gene orders, while more relaxed definitions just require a certain similarity in gene content, and not necessarily in the same order. Regardless of the way they are identified, the goal is to characterize homologous genomic regions, i.e., regions deriving from a common ancestral region, reflecting a certain gene co-evolution that can enlighten important functional properties. In addition of being able to identify them, it is also necessary to infer the evolutionary history that has led from the ancestral segment to the extant ones. In this field, most algorithmic studies address the problem of inferring rearrangement scenarios explaining the disruption in gene order between segments with the same gene content, some of them extending the evolutionary model to gene insertion and deletion. However, syntenies also evolve through other events modifying their content in genes, such as duplications, losses or horizontal gene transfers, i.e., the movement of genes from one species to another. Although the reconciliation approach between a gene tree and a species tree addresses the problem of inferring such events for single-gene families, little effort has been dedicated to the generalization to segmental events and to syntenies. This paper reviews some of the main algorithmic methods for inferring ancestral syntenies and focus on those integrating both gene orders and gene trees.


Introduction
Genes are the basic units of heredity containing the genetic information responsible for the functioning of a cell. During evolution, they are mutated, duplicated, lost and passed to organisms through speciation, the evolutionary process by which a population evolves to become a distinct species, or Horizontal Gene Transfers (HGT), largely shaping the evolution of bacteria, where genes are passed from one species to another. In addition, their order on the genome is modified through various rearrangement events, such as inversions, transpositions or translocations. See Figure 1(1) for an evolutionary history of gene sequences involving a variety of rearrangement, duplication and loss events, and Figure 1(2) for an evolutionary history of a single gene family also involving a HGT event.
Although mutations modifying genomic contents (gene gain and loss) and rearrangements modifying gene orders play a concerted role in shaping gene families, they are usually considered separately: gene gain and loss in the context of inferring the evolution of a given gene family, and rearrangements in the context of understanding genome evolution. In other words, in contrast to rearrangements, gain and loss events are usually considered to be single gene events.  Figure 4 in [1], representing the evolution of tRNA repertoires in the Bacillus genus. The tree represents the speciation history of a set of Bacillus species. Each colored arrow represents a block of tRNAs, following the operon subdivision available for B. cereus. Two arrows of the same color represents a duplicated block. Gray rectangles indicate the segment affected by an inversion. Notice that blocks orientation (indicated by the orientation of the arrow) does not reflect the reality, it is just given to illustrate the effect of an inversion, which not only inverts the order, but also the orientation of the blocks. (2) An evolutionary history of a single-gene family (for example, a set of arrows of one given color in the set of bacterial genomes) belonging to the set of genomes Σ = {A, B, C}. The gene family Γ = {a 1 , a 2 , b 1 , b 2 , c 1 } is such that a gene x i belongs to the genome X. The evolution of the gene family inside the species tree S is represented up, and the induced gene tree T is represented bellow. This evolutionary history involves a duplication (represented by a rectangle), losses (dotted lines) and a HGT event (represented by a horizontal line in S and a cross in T).
For a given gene family Γ with gene copies located in a set Σ of genomes, a gene tree T for Γ (representing the evolution of the gene sequences through nucleotide or amino acid mutations) and a species tree S for Σ, the reconciliation approach [2] consists of inferring the evolution of Γ by embedding T into S and explaining the incongruence between the two trees from duplications, losses or HGT events that would have obscured the speciation scenario. Reconciliation is based on the assumption that each gene family evolves independently. Although this hypothesis holds for genes that are far apart in the genome, it is clearly too restrictive for those grouped into syntenies, i.e., forming a set of homologous chromosomal regions, meaning that they are deriving from a common ancestral interval, with approximately the same gene content and order. Although convergent evolution should not be excluded, such co-linear sequences of genes are more plausibly the result of a concerted evolution from a common ancestral region, rather than of an independent set of gene duplications that would have generated the same gene organization in different genomic regions.
The neuropeptide Y-family receptors [3], the Homeobox gene clusters [4][5][6], the FGFR fibroblast growth factor receptors [7,8], the genes of the opioid system [9][10][11] or the major histocompatibility complex encoding numerous immunologically vital genes playing an imperative role in controlling the vertebrate adaptive immunity [12], are a few examples of genes organized in syntenies in human, as well as in numerous vertebrate genomes. Many of these gene families, appearing in potentially quadruplicated regions in human and other mammalian genomes, have been considered to be evidence of the "2R hypothesis" [13] assessing two rounds of whole genome duplication events in the evolution leading to the contemporary vertebrate genomes. Transposed duplications copying genes or chromosomal segments from an original locus to a new one also play an important role in the evolution of syntenies. Being able to make the difference between the two modes of evolution is also important [14].
Operons in bacteria, containing adjacent genes that are transcribed together into a single mRNA sequence, is another example of genes organized in syntenies [15]. This organization provides a valuable source of information. For example, genes belonging to the same metabolic pathway were found to be organized in similar operons in microorganisms of different phylogenetic lineages, such as Escherichia coli and the Gram-positive Bacillus subtilis [16]. Notice that as horizontal transfers between bacteria of the same or different proteobacterial branches play a major role in shaping bacterial operons, an evolutionary model for studying the origin and evolution of operons cannot avoid considering transfer events.
From an algorithmic point of view, research has focused mainly on the evolution of single-gene families based on sequence divergence and single-gene gain/loss on one side [17], and on the inference of ancestral genomes based on gene content and order of extant genomes on the other side [18]. For the latter branch of research, the considered methods can be grouped into distance-based methods labeling ancestral nodes in a way minimizing total branch length over the phylogeny, and synteny-based (or mapping) methods first inferring a collection of relations between ancestral genes in terms of adjacencies, and then assembling this collection into Contiguous Ancestral regions (CARs) [19]. This latter method can be seen as generating ancestral syntenies (conserved regions) from a set an extant genomes.
What about inferring the evolution of a set of syntenies? In other words, what about the intermediate stage between gene family evolution and genome evolution? In this paper, we review some of the strategies that can be used for this purpose, that combine both information on gene order and gene trees. This review can be seen as a follow-up on a previous review of the evolution of gene families [20], and another presenting the state-ofthe-art on algorithmic methods accounting for all different types of evolutionary events (sequence, order and content) [21]. Another relevant review is that of Anselmetti et al. [18] on the reconstruction of ancestral genomes. However, the present review has a specific focus on the evolution of syntenies, rather than on single genes on one extremity, and on whole genomes on the other. I begin by introducing the concept of syntenies, and the general notations on trees in the next section. In Section 3, I briefly review the sorting by rearrangement problem on two permutations and on a phylogeny, and extend the review to the methods accounting for gene gain and loss in Section 4. The main part of this paper is Section 5 where I review, in more details, algorithms for predicting synteny evolution, accounting for both gene trees and gene orders in a unifying framework. I finally conclude with a discussion on open problems.

Syntenies Defined as Gene Orders
The term "synteny", first introduced in 1971 [22], arose from the need to refer to Human genes located on the same chromosome, but with a genetic distance that could not be determined by the frequency of recombination inferred from the new gene mapping methods. As recalled in [23], synteny means "same thread" (or ribbon), a state of being together in location, as synchrony means being together in time. Thus, according to the original definition, saying that two genes are syntenic only means that they are located on the same chromosome. Today however, the term is largely used by biologists in an evolutionary meaning to design genes or chromosomal segments with a common evolutionary ancestry, i.e., homologous genes, or regions of contiguous genes.
For example, CoGe (https://genomevolution.org/wiki/index.php/Synteny (accessed on 8 April 2021)), a platform for performing comparative genomics research, defines a synteny as a valid deduction that two or more genomic regions derived from a single ancestral region. Inferring "syntenic blocks" usually relies on inferring pairs of chromosomal regions with a similar gene content and order. The SynMap tool of CoGe identifies such blocks by finding sets of homologous gene pairs and merging them into regions.
Such synteny blocks or regions that are more conserved than average in the genomes can reveal regulatory or functional interactions between the involved genes, or combination of alleles that are advantageous when inherited together. Conversely, breakage of conserva-tion in gene order or gene content is an important footprint of the evolution of genomes through global rearrangements [24][25][26] that can be used to infer phylogenetic trees [27].
Two chromosomal regions with identical gene content and order can clearly be labeled as syntenic. However, because syntenic regions are largely remodeled during evolution, it is usually necessary to relax this strict conservation requirement, allowing for a certain gene content or gene order disruption. Notice that genes are usually represented as signed ("+" for the 5 → 3 strand and "-" for the 3 → 5 strand) units, where the sign or orientation of a gene indicates on which of the two complementary DNA strands the gene is located.
Thus, ranging from a strict definition in terms of conserved segments with identical gene content, order and orientation [25] to the most relaxed one in terms of being located on the same chromosome, the notion of two regions being syntenic has been defined in several ways, also depending on the evolutionary events being considered. In fact, during evolution, syntenic regions evolve independently through local gene rearrangements or local events modifying their gene content, such as tandem duplications adding genes or, conversely, losses removing genes. They also evolve collectively through transpositions and translocations splitting a single synteny into two syntenies, or conversely joining two syntenies into one; new syntenies are created through transposed duplications [28] or whole genome duplication, or conversely lost [29]. They are also passed to organisms through speciation or HGTs (see Figure 1).
From a combinatorial point of view, various formal definitions of synteny blocks, also called gene clusters have been introduced to allow identifying them in a set of genomes [20,30] (see Figure 2). Notice first that although we define syntenies as sequences of genes, from a combinatorial or an algorithmic point of view, any other marker or unit can be considered instead of genes. The notion of common intervals [31][32][33] refers to conserved segments in which we relax the conditions that genes appear in the same order or the same orientation. Formally, given K genomes represented as permutations on an alphabet Σ, a common interval is a subset S of Σ such that in each genome, all the genes in S are contiguous, i.e., grouped together with no other gene in between them, but not necessarily in the same order. In particular, strong common intervals, defined as common intervals that do not overlap with any other common interval [34], have rich combinatorial properties [30]. A more relaxed definition of synteny blocks account for possible gaps between genes. A first formal model of max-gap clusters was introduced in [35] under the name of gene teams: Given K genomes, a gene team is a maximum subset A of a set of genes Γ such that in each genome, any gene in A is separated by at most δ genes from another gene of A. Common intervals and max-gap clusters completely ignore gene orders. A compromise between gene content and gene order conservation is given in [36,37] where two genes adjacent in one genome are required to be separated by at most δ genes in another genome.
We now introduce some terminology and notations on gene families and trees that we will use in this paper.

Gene Families
Two homologous genes or regions X 1 and X 2 are said to be orthologous if the last event that has led to the creation of X 1 and X 2 from a common ancestor is a speciation, paralogous if it is a duplication and xenologous if it is a HGT event. For example, in Figure 1(2), {a 1 , b 1 } are orthologous, {a 1 , b 2 } are paralogous and {a 2 , b 1 } are xenologous. A gene family refers to a set of homologous gene copies (orthologous, paralogous or xenologous) in one or many genomes. Gene families are usually inferred from gene sequence identity. In this paper, the alphabet Γ X of a chromosomal region X is the set of gene families with loci in X; a sequence X of genes is called a permutation of Γ if it contains exactly one copy from each gene family of Γ. Two sequences X and Y are said to have the same gene content if they are defined on the same alphabet and have the same number of gene copies for each gene family. If gene orientations are known, then the elements of Γ appear in X accompanied with a sign + or −; we talk about signed syntenies (e.g., signed permutations).

Trees
If not specified differently, all trees are considered rooted and binary, where a binary tree is a tree with all internal (i.e., non-leaf) nodes being binary. We denote by r(T) the root, by V(T) the node set, by L(T) ⊂ V(T) the leafset and by E(T) the edge set of T. An edge of E(T) is written as a pair (x, y) of two adjacent nodes, where x, the closest to the root, is called the parent of y and y is called the child of x. In a binary tree, each internal node has two children. For an internal node x of a tree T, we denote by T x the subtree of T rooted at x.
The lowest common ancestor (LCA) in T of a subset L of L(T), denoted by lca T (L ), is the ancestor common to all nodes in L that is the most distant from the root.
Given a binary tree T, an extension of T is a tree T obtained from T by grafting edges to T, where grafting consists of subdividing an edge xy of T, therefore creating a new node z between x and y, then adding a leaf w with parent z.
A tree S is a species tree for a set Σ of species if its leafset is in bijection with Σ. A species tree represents an ordered set of speciation events that have led to Σ.
A gene family is a set Γ of genes where each gene g belongs to a given species S = s(g) of Σ. A tree T is a gene tree for a gene family Γ if its leafset is in bijection with Γ.

The Sorting by Rearrangement Problem
In 2003, Pevzner and Tesler [38] developed the notion of synteny blocks as chromosomal segments represented as permutations, that can be converted to identical permutations through micro-rearrangements. The GRIMM-Synteny algorithm [39] constructs synteny blocks from a dot-plot of anchors representing similarities between genes or non-coding regions, and chaining them ignoring micro-rearrangements.
The SORTING BY REARRANGEMENT PROBLEM consists of inferring a rearrangement history of minimum cost, for a given model of evolution, allowing the transformation of a permutation X into another permutation Y. For a unitary cost of operations, we call Rearrangement Distance between X and Y the minimum number of allowed operations transforming one synteny into the other.
Given two permutations, the SORTING BY REARRANGEMENT PROBLEM has been shown to be solvable in linear time for the inversion, translocation (including chromosomal fusion and fission), inversions+translocation distances [40][41][42], as well as for the SCJ (Single-Cut-or-Join) [43] and the DCJ (Double-Cut-and-Join) distance [44], where an SCJ event breaks or creates an adjacency, and a DCJ event breaks two adjacencies and reconnects their extremities in any possible manner. SCJs and DCJs are artificial events unifying most rearrangement events (inversions, transpositions and translocations) in a single model. On the other hand, computing the transposition distance between two permutations has been shown NP-hard [45], although efficient bounded heuristics exist, the best algorithm so far having an approximation factor of 1.375 [46].

The Small Phylogeny Problem
Inferring the evolutionary history of a set of syntenies represented as gene orders has mainly been handled as a SMALL PHYLOGENY PROBLEM [47]. Given a single synteny per genome (i.e., no paralogous syntenies in the same genome are allowed), and given a known phylogenetic tree for the set of considered species, the problem is to infer the ancestral syntenies at the internal nodes of the tree in a way optimizing certain mathematical criteria according to the chosen evolutionary model. Those criteria are usually related to minimizing the number or cost of evolutionary events leading to the extant syntenies, although maximum likelihood criteria have also been considered.
Given a node-labeled tree S, where labels are syntenies, and given two adjacent nodes u and v in S where u is the parent of v, u is labeled by X u and v is labeled by X v , the length of the branch (u, v) of S is the minimum cost of an evolutionary scenario transforming X u to X v . Then the general problem can be formulated as Algorithms 1.

Algorithm 1 Small Phylogeny Problem SMALL PHYLOGENY PROBLEM:
Input: A phylogenetic tree S for a set Σ of species, a set Γ of gene families, a set X of syntenies on Γ labeling the leaves of S and a model of evolution; Output: A synteny labeling of the internal nodes of S minimizing the total branch length over the phylogeny. This problem has been most often considered in the context of inferring ancestral genomes, i.e., where syntenies are actually entire chromosomes or genomes. For most formulations in terms of different kinds of genomes (circular, multichromosomal, single or multiple gene copies, signed or unsigned genes) and different cost or distance metrics, even the simplest restriction in terms of the median of three genomes (an unrooted three leaf phylogeny) has been shown NP-hard [48].
Based on the breakpoint graph of two permutations [49], and considering three genomes at a time, the MGR [50] algorithm infers the median by iteratively performing "good" reversals, i.e., reversals diminishing the distance between the three considered genomes. The MGRA [51] algorithm uses a generalization of the breakpoint graph, called multi-colors graph, to more than two permutations, and performs 2-breaks (corresponding to the standard reversals, translocations, fissions, and fusions) "consistent" with the given species tree.
On the other hand, the steinerization method is probably the most popular heuristic for the small phylogeny problem. First assigning an initial synteny labeling to each internal node of the phylogeny, the solution is then refined iteratively by decomposing the phylogeny into a set of overlapping median configurations, updating the median at each step only if it diminishes the sum of the lengths of the branches incident to the median, and iterating until eventually converging to a minimum. The quality of the solutions largely depends on the initialization of the ancestral gene orders. Various initialization strategies have been considered, with the purpose of avoiding local minima. In particular, based on a divideand-conquer heuristic for finding a median of three permutations minimizing the inversion distance, GASTS [52] uses an accurate initialization step, allowing for an efficient algorithm running several orders of magnitude faster than existing approaches. Another approach, the Pathgroup approach [53], is based on storing partially completed breakpoint graphs on each node of the phylogeny and greedily completing them, following a priority list, in a bottom-up traversal of the species tree. The partial graphs eventually accumulate enough edges in their pathgroups so that cycles can be formed and so that fragments of the ancestral genome can be reconstructed. Other strategies for an initial assignment may consist of a lifted labeling, i.e., taking, for an internal node x, one gene order (synteny) among those labeling the leaves of S x , or considering all gene orders that are in a certain neighborhood of the extant ones [54].

Accounting for Gene Gain and Loss
In the above section, we restricted the review to the papers considering syntenies (or genomes) as permutations on the same alphabet (same set of genes). However, gene loss and gene duplication can also modify the content of synteny blocks.
As for gene losses, they are relatively easy to integrate in the sorting by rearrangement algorithms. More precisely, for the case of syntenies represented as two permutations on two different alphabets (some genes occurring exclusively in one of the two sequences), the inversion+indel problem which consists of computing the minimum number of inversions, insertions and deletions (indels) transforming one synteny into the other, has been shown equivalent to the DCJ+indel distance computation when the breakpoint graph representing the two syntenies has no "bad components" [55,56]. Moreover, linear time extensions of the DCJ distance computation to the DCJ+indel distance computation have been developed [57,58]. In addition, an extension of the MGRA algorithm, which reconstructs the ancestral genome of multiple genomes using a multi-color breakpoint graph, has been extended to MGRA2 [59] allowing for indels.
However, when duplicates are allowed in syntenies, an extra degree of difficulty is introduced as the one-to-one correspondence between gene copies is not established in advance. In this case, all pairwise rearrangement problems become hard [60]. A review of the methods used for comparing two ordered gene sequences with duplicates can be found in [61]. These methods are grouped into two main classes: those following the Match-and-Prune model, aiming at transforming strings into permutations to minimize a rearrangement distance between the resulting permutations [62][63][64][65], and those following the Block Edit model, introduced in its most general form by Lopresti and Tomkins [66], which consists of covering the two compared syntenies with pairs of blocks to minimize several certain block operations. Such operations can be substitutions, inversions, transpositions, but also duplications. To maintain the symmetry of the resulting distance, a "block uncopy" (symmetrical to a duplication) is also considered.
As reviewed in [61], almost all versions of the Block Edit model are NP-hard. Moreover, even ignoring rearrangements and asking for an optimal sequence of duplications and losses transforming a synteny into another is shown APX-hard even if the number of occurrences of a gene inside a genome is bounded by 2 [67]. Exact exponential-time algorithms based on Integer Linear Programming (ILP) [68,69] and a polynomial-time heuristic based on dynamic programming [70] have been developed for this model, the latter being extended to rearrangements (inversions and transpositions), in addition to duplications and losses. The implemented OrthoALign software [1] has been applied, in a phylogenetic framework, to infer the evolution of transfer RNA repertoires in the Bacillus genus. Recently, an ILP formulation for the DCJ-Indel distance of "natural genomes", i.e., where any marker may occur an arbitrary number of times in any of the two genomes, has been developed [71]. Notice that the problem is slightly easier to handle for balanced syntenies, (i.e., two syntenies containing the same number of occurrences of each gene) though still NP-hard. For computing the DCJ distance of balanced genomes, an integer linear programming (ILP) formulation has been developed [72], as well as a linear time approximation algorithm using the adjacency graph (an alternative representation of the breakpoint graph) [73], with approximation factor O(k) where k is the maximum number of occurrences of any gene in the input genomes.
Finally, more complex evolutionary models have been considered [74,75] unifying the study of various problems on sequence alignment (nucleotide substitutions), rearrangements, duplications and homologous recombinations. These models are tractable only under some strict conditions, such as the hypothesis of no breakpoint re-used in [74], or under strict combinatorial constraints of the "history graph" introduced in [75].

Accounting for Gene Trees
The aforementioned methods for inferring the evolution of a set of syntenies only consider syntenies' contents and gene arrangements, while ignoring the evolution of each gene family through nucleotide and amino acid substitutions and indels affecting their sequences. A plethora of methods exist for reconstructing gene trees from sequence divergence. Classical methods use a distance, maximum likelihood or Bayesian approach for inferring the gene tree best representing a sequence alignment (e.g., PhyML [76], RAxML [77], MrBayes [78]), while others use a species tree, in addition to a multiple sequence alignment, to model gene gains and losses inferred from the reconciliation between gene and species trees (e.g., TreeBeST [79], PhylDog [80], ALE [81]). Several gene tree databases from whole genomes are available, including Ensembl Compara [82], PhylomeDB [83] or Panther [84].
In the following, we review the computational approaches that have been considered to integrate gene trees, in addition to gene order and/or gene gain and loss, in a unifying framework for synteny evolution.

The Reconciliation Approach
Given a gene tree T and a species tree S representing the true bifurcation histories of the considered gene family and set of species, respectively, inferring the scenario of gene gain and loss explaining the difference between the two trees is the purpose of the gene-treespecies-tree-reconciliation approach [2,17]. A Reconciliation of T with respect to S is usually defined as an event-labeled extension of T, where an internal node label represents the event at the origin of the bifurcation, and grafted branches represent lost (or missing) genes. The considered events are most often speciation, duplications and possibly HGTs. In particular, a most parsimonious reconciliation minimizing the number D(T, S) of Duplications (the D-distance) or the number DL(T, S) of Duplications and Losses (the DL-distance) can be found in linear time using the LCA (Last Common Ancestor) mapping [85][86][87].
Given a tree T, an extension R of T (R can be equal to T), and a mapping s from L(T) to V(S) (indicating the genome to which the gene associated with each leaf of T belongs), an extension of s is a functions from V(R) to V(S) such that for each leaf x of T,s(x) = s(x). Considering an evolutionary model for a gene family accounting for Duplications (D) and Losses (L) in addition to speciation, the DL-reconciliation of T with respect to S is defined as follows.
Definition 1 (DL-reconciliation). Let Γ be a gene family where each x ∈ Γ belongs to the species s(x) of Σ. Let T be a binary gene tree for Γ and S be a binary species tree for Σ. A DL-reconciliation is a triplet R,s,ẽ where R is an extension of T ands is an extension of s such that for each binary node x of R with two children x l and x r , one of the following cases holds:

representing a duplication in σ;
A grafted leaf on a newly created node x corresponds to a loss ins(x).
As R is as an extension of T, each node in T has a corresponding node in R. In other words, we can consider that V(T) ⊆ V(R). In particular, the species labeling on R induces a species labeling on T.
Given a cost function c on duplications and losses and a reconciliation R = R,s,ẽ , the cost c(R) is the sum of costs of the induced events.
Given a cost of 0 for speciation and a unitary cost for duplications and losses, the DL-reconciliation R,s,ẽ of minimum cost DL(T, S) is unique and obtained froms being the LCA-mapping, i.e., verifying, for any internal node x of V(R) ∩ V(T),s(x) = lca S (s(L(T[x]))). We also refer to this reconciliation as the lca-reconciliation. See Figure 3 for an illustration.

Adjacency Evolution
Rather than considering each gene family independently, a first approach towards integrating dependency information between genes is to account for gene adjacencies. This is the goal of the DECO algorithm [88] integrating gene tree and gene order information for the purpose of inferring the evolution of syntenies, where syntenies are restricted to adjacencies, i.e., segments of two genes.
More precisely, given a set of gene families, each represented by a reconciled gene tree, i.e., a gene tree node labeled by the event (D or S) at the origin of the bifurcation represented by each internal node, and given a set of adjacencies between genes, the algorithm seeks for an evolutionary scenario of the adjacencies in agreement with the reconciled gene trees, and minimizing adjacency gain and breakage.
Considering an appropriate clustering of adjacencies, the problem is shown to reduce to a set of problems with exactly two gene trees where all adjacencies are between those two trees. Therefore, the input of the DECO algorithm is a pair of event-labeled gene trees (reconciled gene trees) and a set of adjacencies between the two trees, and the result is a forest of adjacency trees (called adjacency forest) of minimum cost (determined by adjacency gain and breakage), where an adjacency tree describes the descent pattern of adjacencies: for adjacency AB to descent from an adjacency CD, gene A should descent from gene C and gene B from gene D. An adjacency tree is event-labeled, where an event-labeling an internal node may be a speciation, an adjacency duplication or a gene duplication. Adjacency breakage and gene loss are represented by grafted edges, and adjacency gains are the roots of new trees (see Figure 4, inspired from Figure 2 in [88], for an illustration). A polynomial-time dynamic programming algorithm is developed, based on a set of recurrences detailing all the cases of adjacency breakage or gain, depending on whether only one or the two genes of an adjacency are duplicated or lost, together or separately.
Applied on an arbitrary set of adjacencies between an arbitrary set of gene trees, the DECO approach infers a set of ancestral adjacencies for the ancestral species of a species tree. However, as each adjacency is considered independently from the others, inferred ancestral adjacencies are not guaranteed to be compatible with a linear structure. Although this may be seen as a drawback of the method, it can also be used as benchmark for correcting gene trees [89].
As reviewed in [18], following the initial model, several extensions of DECO have been considered, such as accounting for horizontal gene transfers (see Section 5.6) [90], or handling fragmented extant genome assemblies (ART-DECO) [91]. These extensions are implemented in a unique software called DECOSTAR [92]. A global score accounting for the gene tree likelihood, the reconciliation cost and the adjacency gain and breakage cost was also developed [93].

Evolution through Segmental Duplications and Losses
Another strategy for considering the co-evolution of adjacent genes in syntenies is to generalize the reconciliation model to account for segmental duplications and losses rather than single-gene events (see Figure 5).
In [94,95], the DL-reconciliation of a gene tree has been generalized to the DLreconciliation of a "synteny tree" (defined bellow) accounting for segmental duplications and losses. In this study, a synteny X is an ordered sequence of genes belonging to a genome s(X). The genes of a synteny all belong to different gene families and thus, in particular, tandem duplications are not allowed.
We say that a set G = {Γ 1 , Γ 2 , . . . , Γ t } of gene families are organized into a set X of syntenies iff there is a bijection between the genes of G and the genes in X (each gene of G belongs to exactly one synteny of X ). A synteny tree is a tree with leaves mapped to syntenies. In particular, a synteny tree T for X is a tree with a one-to-one mapping between L(T) and X . (1) Three gene families (red, blue and green) organized into four syntenies X = {A 1 , A 2 , B 1 , B 2 } located in two genomes A and B, and the reconciled gene trees for each gene family (node labels are represented as in Figure 4). Gene orders, as well as gene trees are consistent; (2) The gene trees embedded in the species tree S = (A, B), illustrating an independent evolutionary history for each gene family; (3) A synteny tree for X (left) and a Super-Reconciliation, i.e., an evolutionary scenario of segmental duplications and losses (right), embedded in the species tree S. Each node of the Super-Reconciliation is labeled by the type of event (duplication represented by a rectangle, speciation by an oval and loss by a cross) and the segment affected by a duplication (inside the rectangle) or a loss (marked by the cross). The bold edge incident to a duplication node is the one leading to the duplicated synteny; (4) An unordered Super-Reconciliation for a new set of syntenies {A 3 , A 4 , B 3 , B 4 } that are not order-consistent. The minimum number of duplications and losses explaining this synteny tree is 5, but rearrangements are still required on some branches as orders are not conserved.
The syntenies of a synteny family X are considered to have evolved from a single ancestral synteny through speciation (defined as for single genes), segmental duplications and segmental losses, where: • a speciation Spe(X, [1, l]) acting on a synteny X = g 1 · · · g l belonging to a genome s(X) has the effect of reproducing X in the two genomes s l and s r children of s(X) in S; • a (segmental) duplication Dup(X, [i, j]) acting on a synteny X belonging to a genome s(X) is an operation that copies a substring g i · · · g j of X = g 1 g 2 · · · g i · · · g j · · · g l somewhere else into the genome s(X), creating a new copied synteny X = g i · · · g j , where each g k , for i ≤ k ≤ j, belongs to the same gene family as g k ; • a (segmental) loss Loss(X, [i, j]) acting on a X = g 1 · · · g l is an operation that removes a substring g i · · · g j of X, leading to the truncated synteny X = g 1 · · · g i−1 g j+1 · · · g l . A loss is called full if X is the empty string (i.e., all genes of X are removed) and partial otherwise. A partial loss event is denoted pLoss.
Thus, in contrast to a single-gene family, a tree representing the evolution of a set of syntenies is not only labeled by the type of eventẽ(x) corresponding to each internal node, but also by the segment of the synteny affected by the event.
Now, given a set T = {T 1 , T 2 , · · · , T t } of gene trees for the gene families G = {Γ 1 , Γ 2 , · · · , Γ t } organized into a set X of syntenies belonging to a set Σ of taxa, and given a species tree S for Σ, the goal is to infer a history of segmental duplications and losses that gave rise to the extant syntenies from a unique ancestral synteny. Clearly, the problem can be subdivided into two parts: 1. Given the set T of gene trees for G, find a synteny tree T for X ; 2. Given a species tree S for Σ, find a Super-reconciliation R,s,ẽ of T with S, i.e., an event-labeled synteny tree which is a "partial extension" of T, representing a valid history for X , of minimum DL-distance. Here, the DL-distance of R,s,ẽ is the number of induced segmental duplications and losses.
Notice that due to partial losses, a valid history for X is not necessarily a binary tree, but rather a partially binary tree, i.e., a tree with each internal node having one or two children. In fact, nodes representing partial losses have a unique child (see for example Figure 5(3), left, node u). This is the reason for "partial tree extension" rather than tree extension, where a partial extension of T is a tree T obtained from T by grafting edges or nodes to T, where grafting a node simply consists of subdividing an edge xy of T, therefore creating a new node z between x and y.
It is important to notice that, ignoring rearrangements, an evolutionary history of duplications (only creating new gene orders, i.e., not modifying existing ones) and losses does not always exist for an arbitrary set of gene orders, and thus for an arbitrary set of syntenies X , regardless of the trees linking them. If this holds, the syntenies are said to be order-consistent. As explained in [95], this can be verified in linear time by representing the gene orders as a directed graph and verifying if it is acyclic. For example, the syntenies {A 1 , A 2 , B 1 , B 2 } in Figure 5(1)-(3) are order-consistent, while the syntenies {A 3 , A 4 , B 3 , B 4 } in Figure 5(4) are not order-consistent,

Synteny Tree Reconstruction
The first problem can be handled as a classical phylogenetic reconstruction problem using an alignment of the sequences composing X , each synteny considered to be a single sequence obtained from the concatenation of its gene sequences. However, with this method, the specificity of each gene family is ignored, in addition to being unsuitable in the case of gene rearrangements obscuring the initial alignment. Rather, if a gene tree T i is available for each gene family Γ i , then a synteny tree may be obtained from those individual trees. In fact, if the set of trees are "consistent", i.e., do not present contradictory phylogenetic information, then a synteny tree may be represented by a "supertree". The consistency problem of rooted trees has been widely studied. The BUILD algorithm [96] can be used to test, in polynomial time, whether a collection of rooted trees is consistent, and if so, construct a compatible, not necessarily fully resolved, supertree, i.e., a tree displaying them all. This algorithm has been generalized to output all compatible minimally resolved supertrees [97][98][99], which may be exponential in the number of genes.
If the gene trees are not consistent, a synteny tree may be obtained from a greedy consensus tree method (strict, majority rule or singular majority rule consensus) [100] reconstructing a "consensus tree", i.e., a tree minimizing a given distance with the set of input trees. Alternatively, we may want to minimally correct the input gene trees in a way they become consistent. In the case of gene families likely containing multiple copies in the same synteny, a way of doing can be to keep a single copy in each synteny. This is actually required for the super-reconciliation model considered in [94,95], as tandem duplications are not allowed in this model. Formally, given a set of gene trees represented as MUL-trees, i.e., trees with potentially repeated leaf-labels, the problem is to prune them in an appropriate way, keeping a single copy of each label.
MUL-trees remain relatively little studied compared with single-labeled phylogenetic trees, mainly due to the fact that many problems that are tractable for phylogenetic trees become NP-hard when extended to MUL-trees. For example, most generalizations of the tree distances [101], as well as generalizations of consensus tree methods are exponential in the case of MUL-trees [102][103][104]. In a recent publication [105], we consider two problems related to pruning MUL-trees. First, given a set of MUL-trees, the SET PRUNING FOR CONSISTENCY, or MULSETPC, problem asks for a leaf-pruning of each tree leading to a set of consistent trees. Second, proceeding each gene tree at a time, the MUL-TREE PRUNING FOR RECONCILIATION (MULPR) problem asks for a pruning minimizing a reconciliation cost with a given species tree. Both problems are shown NP-hard. Nevertheless, an accurate greedy heuristic for MULPR has been developed.

Super-Reconciliation
Given a set of gene families G = {Γ 1 , Γ 2 , · · · , Γ t } organized into a set of syntenies X , a synteny tree T for X and a species tree S for the set Σ of species containing the syntenies, the problem is to infer a scenario of segmental duplications and losses explaining T with respect to S. Therefore, at this stage, the super-reconciliation problem can be seen as a generalization of the classical reconciliation problem allowing for segmental events.
In [94,95], this problem is handled by a two-step algorithm: First label the internal nodes of T as duplications or speciation following the LCA-mapping ( Figure 5(3), left), and then infer an optimal scenario of losses in agreement with this event-labeled treẽ T ( Figure 5(3), right). This two-steps method has been shown exact, i.e., leading to a super-reconciliation of minimum cost (DL-distance).
The main problem is the second step which consists of extending the treeT with losses and infer the actual event at each node (i.e., the corresponding synteny and segment being duplicated or lost). As losses and segments affected by the events are fully determined by gene orders assigned to internal nodes, the problem actually reduces to the small phylogeny problem, i.e., the problem of assigning syntenies to internal nodes ofT.
For x ∈ V(T), define d(x, X) as the minimum number of segmental duplications and losses induced by a synteny assignment ofT x with X being the assignment at x. The SMALL PHYLOGENY FOR SYNTENIES problem is to find an optimal assignment, i.e., an assignment leading to d(T) = min X d(r(T), X) for X belonging to the set of syntenies that are order-consistent with X . Let x be an internal node ofT and x l , x r be its two children. Let X, X l , X r be valid assignments for respectively x, x l and x r . Then X l and X r are subsequences of X. If x is a speciation, then all missing genes in X l and X r result from losses. Otherwise, if x is a duplication, then for at most one of X l and X r , the missing prefix or suffix can be due to the partial duplication of a segment of X, and all other missing genes should be losses. Therefore, we define two distances: D T (X, Y) for the minimum number of segmental losses required to transform X to Y and D P (X, Y) for the minimum number of segmental losses required to transform a substring of X to Y.
Based on the above observations, recurrence relations are defined for a dynamic programming algorithm computing d(x, X), for each x ∈ V(T) and each possible synteny X. The exponential-time complexity of the algorithm is due to the exponential size of the set of syntenies that should be considered at each internal node ofT.

Unordered Super-Reconciliation
As ignoring rearrangements is usually much too restrictive and asking for consistent gene orders is not very realistic, a variant of the above model is to allow for rearrangements, yet only consider minimizing duplications and losses. Alternatively, this can be seen as a variant of the Super-Reconciliation problem, ignoring gene orders. For example, Figure 5(4) reflects a history for the syntenies {A 3 .A 4 , B 3 , B 4 } that are not order-consistent, with three duplications and two losses. However, rearrangements are still required on some branches, for example on the branch leading to B 3 or the one leading to v.
Reducing each synteny X to its set Set(X) of genes (i.e., ignoring the order of genes), an unordered evolutionary history of a set of syntenies can be represented as a partially binary tree where each internal node x corresponds to an event e(Set(X)) with X being the synteny at x and e ∈ {Spe, Dup, pLoss} such that if e is:

1.
Spe, then x is a binary node with two children corresponding to syntenies Y and Z such that Set(X) = Set(Y) = Set(Z) and s(Y) and s(Z) are the two children of s(X) in S.

2.
Dup, then x is a binary node with two children corresponding to syntenies Y and Z such that Set(Y) = Set(X), Set(Z) ⊆ Set(X) and s(X) = s(Y) = s(Z).

3.
pLoss, then x is a unary node with a child corresponding to a synteny Y such that Set(Y) Set(X) and s(X) = s(Y).
An Unordered Super-Reconciliation (USR) is a labeled synteny tree representing a valid unordered evolutionary history for X . The UNORDERED SUPER-RECONCILIATION problem then consists of inferring the USR of minimum cost. As for the ordered version of the problem, the USR Problem reduces to a small phylogeny problem which consists of inferring internal node gene contents of a treeT leading to a minimal duplication and loss cost. As duplications are already determined by the node labeling ofT, only loss events remain to be minimized, which is done by a programming algorithm running in time O(|V(T)||G|).

Minimizing Duplication Episodes
Another strategy to account for multiple gene duplications is to infer duplication scenarios minimizing duplication episodes, i.e., locations in the species tree where a series of duplications may have occurred. This strategy has been used for the purpose of inferring whole genome duplication events, but it can as well be used for inferring a multiple duplication scenario for the gene families of a set of syntenies.
In the literature dedicated to this problem, a multiple gene duplication refers to a set of single duplications occurring at the same location of the species tree. The most general formulation is the following: Given a set of gene trees T = {T 1 , T 2 , · · · T t } and a species tree S, find evolutionary scenarios for the collection of gene trees that yields the minimum number of multiple gene duplication events.
Most methods presented in the literature start by labeling internal nodes of the gene trees as duplication or speciation nodes according to the LCA-mapping. Two problems are then considered: (1) According to an interval model, assign to each duplication node d of a gene tree T ∈ T an interval Int(d) corresponding to the positions in S where the duplication d may have occurred; (2) According to a clustering model, cluster the duplications into a set of minimum episodes (see Figure 6 for an illustration).
In [106], Paszek and Górecki provides an overview of the interval and clustering models considered in the literature.

The Interval Model
It ranges from the most constrained one restricting Int(d) to the node of S corresponding tos(d) (the LCA-Mapping of d in S as defined above), to the most relaxed one, the FHS-model introduced by Fellows et al. [107], where d can be mapped to any node betweeñ s(d) ands(r(T)). Notice that the FHS-model may lead to converting speciation nodes of T to duplication nodes.
Between these two interval models, the PG-model by Pazek and Górecki [108] is the most relaxed one leading to the minimum number of duplications for each gene tree T, i.e., the most relaxed interval model that does not lead to converting a speciation into a duplication in an lca-reconciliation of T.  Except the FHS-model, all these interval models are examples of the general interval models presented in [110].

The Clustering Model
Given an interval assignment of each duplication node of the set of gene trees, three different duplication clustering models have been proposed in the literature. The Episode Clustering (EC) model allows clustering any two duplications that can be mapped to the same location in the species tree (yellow boxes in Figure 6), while a slightly more constrained model, called the Minimum Episodes (ME) model, excludes cases in which a duplication and an ancestor of this node (from the same gene tree) are clustered together (orange boxes in Figure 6).
The two problems were introduced by Guigó et al. [109] with the interval model being the GMS-model. These problems can be formulated as NP-hard set cover problems [111]. Alternatively, representing them as a Tree Interval Cover (TIC) Problem, polynomialtime algorithms can be designed for the EC and ME models under the GMS interval model [112,113]. Moreover, linear time and space algorithms for the TIC Problem that applies to the EC and ME models have been developed by Luo et al. [114] under every interval model. Finally, Paszek and Górecki [106] proposed a variant of the algorithm for general interval models presented in [110] that runs in linear time for the ME Problem. Solutions to the EC and ME problems for unrooted gene trees and for the PG-model were also studied [108,115].
Finally, in addition to the EC and ME clustering models, the Gene Duplication Clustering (GD) model [107] is similar to EC except that only duplications from different gene trees can be clustered in a single episode.
Notice that the EC Problem for the FHS-model has a trivial outcome with one cluster. On the other hand, the GD and ME problems for the FHS interval model have been shown NP-hard [107,116]. The unconstrained ME model has also been extended to gene losses for different costs of duplications and losses [116].

Evolution of Tandemly Arrayed Gene Clusters
An important class of syntenic regions is constituted by Tandemly Arrayed Gene clusters (TAGs). Slippage during recombination, a mechanism at the origin of tandem repeats which is favored by the presence of repetitive sequences, induces a chain reaction eventually leading to the creation of large TAGs, i.e., groups of paralogous genes that are adjacent on a chromosome. TAGs account for about one third of the duplicated genes in eukaryotes [117]. In human, they represent about 15% of all genes [118] forming a number of complex gene clusters. Those repeated regions are however extremely difficult to study or even to assemble correctly due to the fact that during evolution, the duplication status of segments is obscured by subsequent deletions, breaks and rearrangements. When the step of determining the linear gene composition of a TAG cluster is completed, inferring an evolutionary scenario for the tandemly repeated genes is further complicated by the fact that the phylogenetic signal is often obscured by gene conversion.
Methods based on a preprocessing of a self-alignment dot-plot of a cluster or the dot-plot of a pairwise-alignment of two clusters have been developed for reconstructing a hypothetical ancestral sequence and a duplication scenario leading to an observed gene cluster [119][120][121]. Although these methods are useful to infer recent evolutionary events, they are less appropriate for longer timescales as alignment of the nonfunctional regions becomes impossible due to mutations continuously affecting each duplicated segment.
Assuming correct gene orders and gene trees have been obtained, the problem of inferring the evolutionary scenario of a set of TAG clusters can be handled using the tandem-duplication model of evolution first introduced by Fitch in 1977 [122]. This model assumes that from a single ancestral gene at a given position in the chromosome, the locus grows through a series of consecutive duplications placing the newly created copy next to the original one. Such tandem duplications may be simple (duplication of a single-gene) or multiple (simultaneous duplication of neighboring genes). Based on this idea, several theoretical studies have considered the problem of reconstructing the tandem-duplication history of a single TAG cluster through tandem duplications only (which is not always possible) [123]. The model has been extended to the study of a set of orthologous TAG clusters, with an evolutionary model accounting for losses and rearrangements, in addition to simple tandem duplications [124,125]. Later, Lajoie et al. [126] developed the DILTAG algorithm inferring all most parsimonious evolutionary histories for a single gene cluster, according to a general cost model involving simple and multiple tandem duplications, deletions and inversions. A leaf x i of T denotes a gene copy in genome X, its index i corresponding to its position in the TAG cluster of genome X, as illustrated on the tips of the species tree S on the right. DILTAG proceeds by exploring a "history graph" (search space), where vertices correspond to ordered gene trees, i.e., gene trees with ordered leaves (gene orders), and edges to evolutionary events. The size of the whole search-space being exponential, a greedy heuristic was developed that only conserves, in a queue, the most promising partial evolutionary histories obtained after exploring a given depth of the history graph.
In Tremblay-Savard et al. [127], DILTAG was then used for inferring the evolution of a set of orthologous TAGs (see Figure 7 for an example). The developed MULTIDILTAG algorithm proceeds in two steps. First, ignoring gene orders, an lca-reconciliation of the gene tree T with the species tree S is computed, leading to a set Γ(x) of ancestral genes at each internal node x of S. Then, reinserting gene order and sign information on the leaves of S, the order and sign of genes at the internal nodes of S are inferred by traversing S bottom-up and applying DILTAG on each branch (x, y) of S, with the exception that instead of reaching a single gene, the algorithm stops when it reaches the expected number of gene copies. All orders leading to the optimal solution are then conserved in a set S x .
The set S A of a leaf A is just the TAG cluster corresponding to A. Now, denote by Γ(x) the set of speciation vertices of the lca-reconciliation of T mapping to x. For example, in Figure 7, Γ(D) = {d 1 , d 2 , d 3 }. Let s ∈ {l, r}, denoting the left or the right child of a node. The pre-speciation genome set PG(x s ) is the subset of Γ(x) with a child in the branch (x, x s ), i.e., the genes in Γ(x) that have not been lost after speciation on the branch going to x s . For example in Figure 7, The set S x at each internal node x of S is computed by MULTIDILTAG as follows: (1) For each of s ∈ {l, r}, DILTAG is executed on each element of S x s , and stops as soon as the attained gene order contains |PG(x s )| genes. The set of all ancestral gene orders obtained (output of DILTAG) form an initial pre-speciation set P S x s (further refined by removing the elements that do not lead to a minimum cost). For example, in Figure  During the execution of the algorithm, a solution graph is incremented by adding the appropriate "speciation edges" from the elements of S x to those of P S x l and P S x r giving rise to the minimum distance.
Although showing good performance in inferring the total number and size distribution of duplication events on simulated datasets, a limitation of the MULTIDILTAG heuristic is however in dealing with multiple gene deletions, as the algorithm is highly exponential in this case, and becomes quickly intractable.

Accounting for Horizontal Gene Transfers
Horizontal gene transfer, largely involved in shaping bacterial gene content, has been included later in the reconciliation analysis of gene families in the purpose of inferring scenarios of duplications, losses and transfers and, identifying xenologous gene copies, in addition to orthologs and paralogs. In this context, the DTL distance is the minimum number of Duplications, Transfers and Losses explaining a gene tree T given a species tree S. The following review of DTL-reconciliation is largely inspired from [17].
For a transfer to happen from a source genomes(x) to a target genomes(y) (x and y being two nodes of the gene tree T), both genomes should have coexisted. Therefore, a time-consistent HGT scenario should allow ordering the internal nodes of the species tree S. This problem of finding a most parsimonious time-consistent (or acyclic) DTLscenario, is NP-hard [128][129][130][131]. However, the DTL-distance problem becomes polynomial if consistency requirement is relaxed [132,133]. The main idea is to consider all possible mapping of T nodes to S nodes, using a dynamic programming approach.
More precisely, let c(x, s) be the minimum cost of a reconciliation of T x with S such that x is mapped to s ∈ V(S). The gene tree T is processed bottom-up, with the base-case corresponding to leaves x ∈ L(T) treated as follows: As for an internal node x with children y and z, we must consider the three possibilities of x being labeled as a speciation, duplication or HGT node, with c s (x, s), c d (x, s) and c t (x, s) representing, respectively, those three mutually exclusive cases. Then, c(x, s) = min{c s (x, s), c d (x, s), c t (x, s)}. Finally, the minimum cost of a reconciliation of T with S is min s∈V(S) c(r(T), s).
Ignoring losses and considering the cost of a reconciliation as being the number of duplications and HGT, the following recurrences hold [133]. c d (x, s) = min{1 + c(y, t) + c(z, u) for all descendants t, u of s in S} c t (x, s) = min{1 + c(y, t) + c(z, u) for all t being descendant of s in S and all u being incomparable to s} A straightforward implementation of these recurrences leads to an algorithm in O(mn 2 ) time, where m = |V(T)| and n = |V(S)|. This time complexity has been further improved to O(mn) [134].
The above recurrences may be adapted to handle losses. David and Alm [135] described AnGST, an algorithm for the DTL distance running in O(mn 2 ), while Bansal et al. [132] described RANGER-DTL, an algorithm for the DTL distance running in O(mn).
When divergence time information, or a temporal ordering of internal nodes, is available for S, then the DTL-scenario must respect this ordering (i.e., HGT events are constrained to occur only between co-existing species). A DTL-scenario respecting a dated tree is called a date-respecting DTL-scenario. Bansal et al. [132] show how the definition of a reconciliation and the above recurrences can be adapted to solve this problem. They give an algorithm with O(mn log n) time complexity. Notice that a date-respecting DTLscenario is not necessarily time-consistent. In fact, scenarios may be locally consistent (i.e., HGT events occurring between co-existing species), but globally inconsistent. Global consistency may be obtained by subdividing the species tree S into slices and exploring slices one after the other. This strategy has been first used in [136], leading to an O(nm 4 ) algorithm. Later, Doyon et al. [128] have improved the computation of a most parsimonious DTL-reconciliation for dated trees to O(mn 2 ).
Recently, we extended the reconciliation approach to a special case of horizontal gene transfers, namely Endosymbiotic Gene Transfers (EGT), where genes are transferred solely between the mitochondrial and nuclear genome of the same species [137]. Such transfers from the mitochondrion to the nucleus have been a driven force in the evolution of eukaryotes since the unique ancestral endosymbiotic event integrating an α-proteobacterium into a host eukaryotic cell. The DLE-distance (DLE for Duplication, Loss and EGT) is easier to compute than the DTE-distance, as there is no need for exploring a set of source genomes, nor there is a risk of time inconsistency. The linear-time algorithm developed in [137] for an arbitrary cost of operations can be seen as an adaptation of the quadratic-time DTL algorithm for dated trees, which allows transfer between any co-existing species [128].
Extending the DTL-reconciliation to super-reconciliation, i.e., allowing for segmental gene transfers, raises many questions that must be deeply explored. Given a synteny treẽ T and a species tree S, the super-reconciliation method presented in Section 5.3 runs in two steps: (1) Infer the event-labeling of the internal nodes ofT, this labeling being the one minimizing duplication nodes; (2) Infer an optimal scenario of losses in agreement with this event-labeling. This two-step methodology has been shown exact in the case of the DL-reconciliation, i.e., it has been shown that this method leads to the DL-distance. However, this is not true anymore for the DTL distance. In fact, a reconciliation minimizing duplications, transfers and losses is not necessarily a reconciliation minimizing duplications and transfers. Indeed, considering a node as a transfer rather than as a speciation node may decrease the number of required losses. We are presently investigating this problem of generalizing super-reconciliation to handle transfers, in addition to duplications and losses.

Conclusions
Despite the large effort dedicated to the development of methods for inferring the evolution of synteny blocks, a lot remains to be done towards a unifying model allowing consideration of the variety of evolutionary events shaping the genomes. As reviewed in this paper, most of the algorithmic effort has been invested in the genome rearrangement field on one side, and in the reconciliation field for inferring individual gene gain and loss on the other side. Combining order and content information in the purpose of inferring a co-evolutionary history of genes remains a largely under-explored field.
In most studies on gene gain and loss, segmental movements are indirectly inferred from single-gene movements, by combining co-occurring events or concatenating individual adjacencies. For example, although allowing for a wide variety of events (rearrangements, tandem duplications, losses, HGTs, etc.), the DECO collection of algorithms stand on minimizing single adjacencies' gain or breakage with no direct link to segmental events. The same holds for algorithms dedicated to minimizing episodes of events, only indirectly referring to multiple duplications and ignoring order information. As for the reviewed algorithms for tandemly arrayed gene clusters, although accounting for both rearrangements and segmental duplications and losses, duplications are limited to tandem duplications, and they do not allow for the study of paralogous syntenies, and do not account for the gain or loss of syntenies. Conversely, the Super-Reconciliation approach only accounts for transposed duplications, ignoring tandem duplications. This latter approach, which is a direct generalization of the reconciliation approach to segmental events, opens the door to the same algorithmic perspectives as the classical reconciliation approach, such as generalization to HGT events. However, its main limitation is the difficulty of generating an accurate synteny tree, though a solution may actually be to choose a tree optimizing such a Super-Reconciliation cost. In any case, a better use of individual gene trees, including the inconsistency among those trees, should be considered, ways of including tandem duplications should be explored, as well as better solutions to integrate rearrangement inference algorithms.
To summarize, accounting at the same time for sequence divergence (gene trees), gene order and gene content, evolving through punctual mutations (substitutions and indels), rearrangements and segmental gain and loss events, remain a challenge. Beyond the difficulty of designing an appropriate algorithmic method outpassing the limitations of existing methods, the problem is to design appropriate scores accounting for the variety of evolutionary events. Designing such a general score has been undertaken in [93], but a lot remains to do in this field.
Of course, all the algorithmic methods presented in this paper suffer from the standard limitations of parsimony methods [138,139], such as impossibility of accounting for multiple state changes along a branch of a phylogeny, or uncertainty in phylogenetic reconstructions. An overall statistical framework for evaluating evolutionary hypothesis [140] is surely better than a heuristic outputting a single solution with no measure of reliability. However, the largest the list of considered evolutionary events, the more difficult is the problem of an appropriate sampling for such a statistical study. Being able to generate the whole set of most parsimonious scenarios remains an interesting approach for a probabilistic evaluation of the solutions.
Another related problem lies on the possibility of performing the appropriate simulations for testing the accuracy of the developed algorithms. This goal is also complicated with the increase of the considered data types and evolutionary events. Moreover, some events remain largely unexplored for this purpose, such as the Endosymbiosis gene transfer, a special case of HGTs where genes are exchanged only between the mitochondrial and nuclear genomes of the same species. This event is known to have played a major role in the evolution of eukaryotes [141,142]. Although prior work provides useful insights to understand the parameters influencing such an event [143,144], designing an appropriate model for the simulation of EGT evolutionary histories that can be used to assess the accuracy of our algorithm in [137] remains to be done.