Reconstructing Phylogeny by Aligning Multiple Metabolic Pathways Using Functional Module Mapping

Comparison of metabolic pathways provides a systematic way for understanding the evolutionary and phylogenetic relationships in systems biology. Although a number of phylogenetic methods have been developed, few efforts have been made to provide a unified phylogenetic framework that sufficiently reflects the metabolic features of organisms. In this paper, we propose a phylogenetic framework that characterizes the metabolic features of organisms by aligning multiple metabolic pathways using functional module mapping. Our method transforms the alignment of multiple metabolic pathways into constructing the union graph of pathways, builds mappings between functional modules of pathways in the union graph, and infers phylogenetic relationships among organisms based on module mappings. Experimental results show that the use of functional module mapping enables us to correctly categorize organisms into main categories with specific metabolic characteristics. Traditional genome-based phylogenetic methods can reconstruct phylogenetic relationships, whereas our method can offer in-depth metabolic analysis for phylogenetic reconstruction, which can add insights into traditional phyletic reconstruction. The results also demonstrate that our phylogenetic trees are closer to the classic classifications in comparison to existing classification methods using metabolic pathway data.


Introduction
Phylogenetic reconstruction uses genetic information to reconstruct phylogenetic relationships among living organisms, which is a means to gain insight into the history of species and to retrospect the evolution of species. Some phylogenetic inference methods reconstructed phylogenetic trees based on the similarity of sequences of genes encoding 16S ribosomal RNAs and other marker genes [1]. In recent years, the quantity and quality of the metabolic pathway data have been greatly increased in biological databases like KEGG [2]. Comparative analysis of this vast quantity of metabolic pathway data provides a systematic way of exploring phylogenetic relationships between organisms, which has been demonstrated to be very effective for classifying organisms [1].
Much effort has been made to reconstruct phylogenetic trees in this way, which can be divided into two types. The first type aligns a single pair of metabolic pathways to compute the distance between organisms based on the similarity of enzymes [3], the similarity of reactions [4], and the topological similarity of pathways [5], and then creates phylogenetic trees using such distances. Since multiple metabolic pathways can provide more phylogenetic information than a single metabolic pathway, comparing multiple pathways among organisms can improve the accuracy of phylogenetic analysis. Following this motivation, the second type reconstructs phylogenetic trees by aligning multiple metabolic pathways among organisms. Mano et al. [6] adapted Meta Pathway Hunter (MPH) [7] to compare multiple metabolic pathways and analyze phylogenetic relationships for different organisms. Similarly, Ma et al. [1] employed IsoRankN [8], which is a global multiple-network alignment tool using spectral clustering methods, to investigate phylogenetic relationships from metabolic pathways. Subsequently, Clemente et al. extended the EC-based classification method to compute the structural similarity between pathways, and built phylogenetic trees by aligning all common metabolic pathways of different organisms [9].
In addition, some methods reconstructed phylogeny in other ways. For instance, Mazurie et al. computed phylogenetic distances by exploiting descriptors of metabolic reactions and obtained the phylogenetic trees that are similar to 16S rRNA-based trees [10]. Borenstein et al. provided a logical framework to infer the seed set of a given network and presented a seed method based on essential metabolites to reconstruct phylogenetic trees [11]. Chang et al. represented each organism as a vector of substrate-product relationships and reconstructed phylogenetic trees by comparing the vectors [12].
Although considerable progress has been achieved in reconstructing phylogeny, few efforts have focused on designing a unified phylogenetic framework that sufficiently reflects the metabolic features of organisms [6]. A functional module is a sub-network in metabolic pathway, which performs a certain metabolic function with specific topology [13]. Biologically, the metabolic features of organisms can be inferred from such functional module. Moreover, the topology and metabolic function of functional modules in common pathways are similar for the organisms from the same domain, whereas they may have certain differences for the organisms from different domains. And such differences may contain useful phylogenetic information. Therefore, comparative analysis of functional module mappings between multiple pathways can offer an effective way for building unified phylogenetic framework and revealing the metabolic features of organisms.
In this paper, we propose a phylogenetic framework called MMAL that can characterize the metabolic features of organisms by aligning multiple metabolic pathways using functional module mapping. Our method transforms the alignment of multiple metabolic pathways into the construction of union graph of pathways using functional module mapping, which differs from the existing metabolic pathway alignment methods that directly compare the compounds and reactions in metabolic pathways. By clustering the nodes in the union graph, MMAL identifies the functional modules of pathways and build the mappings between these modules. Finally, MMAL computes the similarity between pathways by comparing the mapped functional modules in pathways, and infers phylogenetic relationships using the similarity.
We demonstrated the effectiveness of MMAL by comparing resulting phylogenetic trees with the NCBI taxonomy. Note that the goal of our work is to explore the phyletic reconstruction from in-depth metabolic analysis, which cannot be afforded by the traditional classification scheme such as the NCBI taxonomy, and to provide insights into traditional phyletic reconstruction from the metabolism standpoint and offer a useful complement to the traditional phylogenetic methods. On the other hand, the NCBI taxonomy is a classic classification scheme that can be used to measure the quality of the resulting trees, and the aim of comparing resulting trees with the NCBI taxonomy is to evaluate the quality of the trees and represent the similarities and differences between resulting trees and classic classifications.
The experimental results show that the use of functional module mapping enables us to correctly categorize organisms into main categories with specific metabolic characteristics. Traditional genome-based phylogenetic methods can infer phylogenetic relationships, whereas our method can offer in-depth metabolic analysis for the phylogenetic reconstruction, which can add insights into traditional genome-based phyletic reconstruction. For example, by analyzing the resulting phylogenies, we have revealed that the metabolic structure of archaea is different from other species. The results also demonstrate that our classification results are consistent with the metabolic features of organisms and are closer to the classic classifications in comparison to existing classification methods using metabolic pathway data.

Results
The overview of our method is summarized in Figure 1. In this work, the metabolic pathway data were retrieved from the KEGG database [2]. For given k organisms, each organism has p common metabolic pathways. We first perform MMAL to align these k × p metabolic pathways and produce a distance matrix of these k organisms. Then, based on the distance matrix, we build a phylogenetic tree for these organisms using the software tool PHYLIP [14], which is usually used to build the phylogenetic trees, and show the phylogenetic tree by the visualization tool TreeView [15]. The program kitsch in the PHYLIP package with the neighbor-joining algorithm was used to build a pathway-based phylogenetic tree from the distance matrix. To evaluate the quality of the produced trees, following [9], we use the well-known software package COUSINS [16] to compare the similarity between the produced tree and the NCBI taxonomy. This framework compares the trees based on cousin pairs: a sibling is a cousin of degree 0, a niece is a cousin of degree 0.5, a first cousin is a cousin of degree 1, and so on [16]. Then, two trees can be compared based on the set of pairs of each degree [16].
As can be seen in Figure 2, our produced trees and Heymans et al.'s tree are similar to each other, and they are similar to the NCBI taxonomy, although there are some differences. The two mycoplasma mge and mpn are the two closest organisms and they are grouped together. The two archaea afu and mja are also clustered together. Comparing T 1 , T 2 , T 3 , and T 4 , we can find that their differences become smaller as the number of aligned pathways increases, which are consistent with the tendency of similarity values in Figure 3. Interestingly, the two archaea afu and mja always formed a single group in each case of our resulting trees, which suggests that the domain of archaea has particular characteristics in the metabolic pathways. This implies that the metabolic structure of archaea is different from that of other species. It is also interesting to observe from Figure 3 that, in 13 out of 21 cases, the similarities of our trees to the NCBI taxonomy are higher than 0.3, which are larger than the similarity between Heymans et al.'s tree and the NCBI taxonomy. This indicates that, although the overall classification results were similar, our method can produce better classifications for these 16 organisms in comparison to another phyletic method.  The union graph is a graph that is constructed by the composite nodes and does not have edges. In the figure, each of common pathways 1, 2, and 3 has a union graph. In phase I, the union graph is constructed for common pathways 1, 2, and 3 respectively. In phase II, the mapped functional modules (the modules are composed of the nodes circled in blue dashed line in the figure) in the union graph of common pathways 1, 2, and 3 are identified. In phase III, MMAL obtains organism distance matrix from the comparison of the mapped functional modules, and builds the phylogenetic tree based on the matrix, wherein dij is the organism distance between organism Oi and Oj, i = 1, 2, 3, 4, j = 1, 2, 3, 4. The union graph is a graph that is constructed by the composite nodes and does not have edges. In the figure, each of common pathways 1, 2, and 3 has a union graph. In phase I, the union graph is constructed for common pathways 1, 2, and 3 respectively. In phase II, the mapped functional modules (the modules are composed of the nodes circled in blue dashed line in the figure) in the union graph of common pathways 1, 2, and 3 are identified. In phase III, MMAL obtains organism distance matrix from the comparison of the mapped functional modules, and builds the phylogenetic tree based on the matrix, wherein d ij is the organism distance between organism O i and O j , i = 1, 2, 3, 4, j = 1, 2, 3, 4.
On the other hand, as shown in Figure 3, the similarity of our produced tree to the NCBI taxonomy fluctuates when the number of pathways is less than 15. However, with the increasing number of pathways, the reconstructed trees had a tendency toward stable similarity to the NCBI taxonomy. This implies that the quality of the reconstructed tree is directly affected by the number of aligned pathways and remains stable when the number of pathways becomes large.
Although the metabolic pathway data of the KEGG database are probably among the most reliable available datasets of this kind [18], some errors in the metabolic pathway data may occur and could affect the classification results. To investigate the robustness against such errors, we randomly added errors to the edges or nodes of each metabolic pathway by randomly modifying a fraction of nodes or edges (eliminating existing ones and/or making new ones). We then reconstructed the phylogenies for these 16 organisms by aligning the pathways with errors. The robustness check  As can be seen in Figure 2, our produced trees and Heymans et al.'s tree are similar to each other, and they are similar to the NCBI taxonomy, although there are some differences. The two mycoplasma mge and mpn are the two closest organisms and they are grouped together. The two archaea afu and mja are also clustered together. Comparing T1, T2, T3, and T4, we can find that their differences become smaller as the number of aligned pathways increases, which are consistent with the tendency of similarity values in Figure 3. Interestingly, the two archaea afu and mja always formed a single group in each case of our resulting trees, which suggests that the domain of archaea has particular characteristics in the metabolic pathways. This implies that the metabolic structure of archaea is different from that of other species. It is also interesting to observe from Figure 3 that, in 13 out of 21 cases, the similarities of our trees to the NCBI taxonomy are higher than 0.3, which are larger than the similarity between Heymans et al.'s tree and the NCBI taxonomy. This indicates that,   As can be seen in Figure 2, our produced trees and Heymans et al.'s tree are similar to each other, and they are similar to the NCBI taxonomy, although there are some differences. The two mycoplasma mge and mpn are the two closest organisms and they are grouped together. The two archaea afu and mja are also clustered together. Comparing T1, T2, T3, and T4, we can find that their differences become smaller as the number of aligned pathways increases, which are consistent with the tendency of similarity values in Figure 3. Interestingly, the two archaea afu and mja always formed a single group in each case of our resulting trees, which suggests that the domain of archaea has particular characteristics in the metabolic pathways. This implies that the metabolic structure of archaea is different from that of other species. It is also interesting to observe from Figure 3 that, in 13 out of 21 cases, the similarities of our trees to the NCBI taxonomy are higher than 0.3, which are larger than the similarity between Heymans et al.'s tree and the NCBI taxonomy. This indicates that, As shown in Figure 4, the similarity to the original tree decreases as the fraction of added errors increases and the similarity value becomes 0.8~0.85 when the error rate is 2%. This demonstrates that a moderate quantity of errors in the metabolic pathway data of the KEGG database affect the results in deed. Interestingly, with the continuously increasing error ratio, it can be seen that the similarity value falls even more clearly when the node errors goes up, which suggests that the impact of the node errors to the classification results is larger than that of the edge errors. and could affect the classification results. To investigate the robustness against such errors, we randomly added errors to the edges or nodes of each metabolic pathway by randomly modifying a fraction of nodes or edges (eliminating existing ones and/or making new ones). We then reconstructed the phylogenies for these 16 organisms by aligning the pathways with errors. The robustness check was performed by calculating the similarity of the trees based on the pathways with errors to the original tree. We performed the robustness check for each fraction of errors 10 times and showed the average values in Figure 4. As shown in Figure 4, the similarity to the original tree decreases as the fraction of added errors increases and the similarity value becomes 0.8~0.85 when the error rate is 2%. This demonstrates that a moderate quantity of errors in the metabolic pathway data of the KEGG database affect the results in deed. Interestingly, with the continuously increasing error ratio, it can be seen that the similarity value falls even more clearly when the node errors goes up, which suggests that the impact of the node errors to the classification results is larger than that of the edge errors.
Since our phylogenetic trees are built based on the metabolic pathway data, in addition to comparing the resulting trees with the NCBI taxonomy, comparing our phylogenetic trees with the phylogenetic trees produced by other phylogenetic methods using metabolic pathway data can help to evaluate the effectiveness of our method from metabolisms. In the following, we analyze the resulting phylogenetic trees in detail and compare the quality of our produced trees with those constructed by competing methods [1,9,12,19] using metabolic pathway data. Since our phylogenetic trees are built based on the metabolic pathway data, in addition to comparing the resulting trees with the NCBI taxonomy, comparing our phylogenetic trees with the phylogenetic trees produced by other phylogenetic methods using metabolic pathway data can help to evaluate the effectiveness of our method from metabolisms. In the following, we analyze the resulting phylogenetic trees in detail and compare the quality of our produced trees with those constructed by competing methods [1,9,12,19] using metabolic pathway data.

Comparison with the Classification Based on Substrate-Product Relationships and the Classification Based on EC Hierarchy
EC hierarchy is the measure based on the enzyme hierarchy proposed by the Enzyme Commission of the International Union of Biochemistry and Molecular Biology, which can be used to assess the similarity of enzymes. In EC hierarchy, each enzyme is appointed a number, the EC number consisting of four digits, and the similarity of enzymes can be measured by comparing the EC number. Clemente et al. [9] used the EC hierarchy to define pathway similarity and study the relationship among eight photosynthetic bacteria and two photosynthetic eukaryotes by pseudo-alignment of over 60 metabolic pathways.
These photosynthetic bacteria are Anabaena sp. PCC7120 (ana), Gloeobacter violaceus (gvi), Prochlorococcus marinus SS120 (pma), Prochlorococcus marinus MED4 (pmm), Prochlorococcus marinus MIT 9313 (pmt), Synechocystis sp. PCC 6803 (syn), Synechococcus sp. WH8102 (syw), Thermosynechococcus elongatus (tel). The two photosynthetic eukaryotes are Arabidopsis thaliana (ath), Cyanidioschyzon merolae (cme). By aligning 57 common metabolic pathways for these 10 organisms, we reconstructed phylogenetic tree T 1 . For these 10 organisms, Chang et al. [12] represented each organism as a vector of substrate-product relationships and reconstructed phylogenetic tree T 2 by comparing the vectors. Clemente et al. [9] reconstructed phylogenetic tree T 3 by the EC hierarchy. Figure 5 shows our produced tree T 1 , Chang et al.'s tree T 2 , Clemente et al.'s tree T 3 , and the NCBI taxonomy T for these 10 organisms. Table 1 shows the similarity measures of the reconstructed tree to the NCBI taxonomy T for these 10 organisms in Figure 5. Table 1. Similarity of reconstructed tree to the NCBI taxonomy T for the organisms in Figure 5. organisms, we reconstructed phylogenetic tree T1. For these 10 organisms, Chang et al. [12] represented each organism as a vector of substrate-product relationships and reconstructed phylogenetic tree T2 by comparing the vectors. Clemente et al. [9] reconstructed phylogenetic tree T3 by the EC hierarchy. Figure 5 shows our produced tree T1, Chang et al.'s tree T2, Clemente et al.'s tree T3, and the NCBI taxonomy T for these 10 organisms. Table 1 shows the similarity measures of the reconstructed tree to the NCBI taxonomy T for these 10 organisms in Figure 5.  As can be seen from Figure 5, the classifications of T1, T2, and T3 are similar to each other. For instance, ana, gvi, and syn are grouped together to a branch, while pmm, pma, pmt, and syw are grouped together to another branch. We also can see that the classifications of gvi, syn and ana are similar in T1, T2, and T3, whereas the branching of tel in T2 is different from that of T1 and T3. Chang et al.'s classification regards tel and gvi as metabolic out-groups in T2 because Chang et al. [12] indicated that gvi and tel were isolated from rocks and hot springs respectively, while the remaining As can be seen from Figure 5, the classifications of T 1 , T 2 , and T 3 are similar to each other. For instance, ana, gvi, and syn are grouped together to a branch, while pmm, pma, pmt, and syw are grouped together to another branch. We also can see that the classifications of gvi, syn and ana are similar in T 1 , T 2, and T 3 , whereas the branching of tel in T 2 is different from that of T 1 and T 3 . Chang et al.'s classification regards tel and gvi as metabolic out-groups in T 2 because Chang et al. [12] indicated that gvi and tel were isolated from rocks and hot springs respectively, while the remaining six species were isolated from fresh or sea water. On the other hand, the common pathways of gvi, syn, and ana are similar in sequence and are different from the common pathways of tel, and therefore T 1 and T 3 separate tel from gvi, syn, and ana. In order to accurately separate the species with similar pathways, such as gvi, syn, and ana, a possible strategy is to combine genome features to the comparison of pathways, which is left as our future work.

Reconstructed Tree Similarity
Since these eight photosynthetic bacteria share a large number of metabolic pathways with these two photosynthetic eukaryotes, it is extremely difficult to distinguish them by comparing their metabolic pathways alone [12]. However, as shown in Figure 5, by using the functional module mapping in the alignment of multiple pathways, our method constructed a phylogeny that distinguishes the photosynthetic eukaryotes from the photosynthetic bacteria, although it failed to separate the Chroococcales (syn, syw, tel) from the Prochlorales (pmm, pma, pmt), the Nostocales (ana), and the Gloeobacterales (gvi). This indicates that the use of functional module mapping depicts the metabolic characteristics of photosynthetic bacteria and photosynthetic eukaryotes and also contains a substantial quantity of phylogenetic information.
Meanwhile, upon detailed comparison of tree topologies, we can also observe that the classifications {cme, ath} and {pmt, pma} of T 1 , T 2 , and T 3 are the same as T, while our produced tree T 1 is more similar to T because the hierarchy of T 1 is closer to T than T 2 and T 3 . More precisely, as shown in Table 1, comparing T 1 , T 2 , and T 3 to T, the similarity of T 1 to T is 0.38, the similarity of T 2 to T is 0.19, and the similarity of T 3 to T is 0. 16. This demonstrates that our method is capable of reconstructing the phylogenies that are closer to the NCBI taxonomy than another two classification methods for these 10 organisms.

Comparison with the Classification Based on Comparing Single Pair of Metabolic Pathways
MP-Align [19] is a typical classification method based on comparing single pair of metabolic pathways. In this section, we compare the phylogenetic tree produced by MMAL with the tree produced by MP-Align.
Alberich et al. [19] analyzed the phylogenies for the following 8 organisms: Archaeoglobus fulgidus (afu), Clostridium perfringens (cpe), Haemophilus influenzae (hin), Listeria innocua (lin), Methanococcus jannaschii (mja), Mus musculus (mmu), Neisseria meningitidis MC58 (nme), and Rattus norvegicus (rno). The classification of these organisms is Bacteria (cpe, hin, lin, nme), Archaea (mja, afu), and Animals (mmu, rno). Alberich et al. performed the pairwise comparison of these 8 organisms for each common pathway and combined the computed scores to generate distance measures between these organisms and constructed phylogenetic tree T 2 for the above 8 organisms. By aligning 47 common pathways for these 8 organisms, we reconstructed phylogenetic tree T 1 . Figure 6 shows our tree T 1 and Alberich et al.'s tree T 2 (T 1 and T 2 are the same) and the NCBI taxonomy T for these 8 organisms.
The similarities of T 1 and T 2 to T are 0.31.  As can be observed in Figure 6, both MMAL and MP-Align do not recover exactly the phylogeny of the NCBI taxonomy, but they can correctly distinguish Bacteria, Archaea, and Animals and successfully classify the Bacteria into two distinct classifications {cpe,lin} and {hin, nme}, as in the NCBI taxonomy. Note that the two eukaryote rno and mmu are the two closest organisms, our method correctly grouped rno and mmu together and also placed the Bacteria cpe, hin, nme, and lin in a group separated from the other species. This implies that these parasitic bacteria show anomalous metabolism in comparison with other species, this is an interesting result which requires further investigation for its reason.  As can be observed in Figure 6, both MMAL and MP-Align do not recover exactly the phylogeny of the NCBI taxonomy, but they can correctly distinguish Bacteria, Archaea, and Animals and successfully classify the Bacteria into two distinct classifications {cpe, lin} and {hin, nme}, as in the NCBI taxonomy. Note that the two eukaryote rno and mmu are the two closest organisms, our method correctly grouped rno and mmu together and also placed the Bacteria cpe, hin, nme, and lin in a group separated from the other species. This implies that these parasitic bacteria show anomalous metabolism in comparison with other species, this is an interesting result which requires further investigation for its reason.

Green Sulfur and Green Non-Sulfur Bacteria
Green sulfur and green non-sulfur bacteria use two different sources of electrons in photosynthesis. Green sulfur bacteria use sulfide ion as the electron donor, whereas green non-sulfur bacteria do not [1]. Ma et al. selected 14 organisms from green sulfur and green non-sulfur bacteria and reconstructed phylogenetic tree T 2 for these 14 organisms: Anaerolinea thermophila(atm), Caldilinea aerophila (cap), Chloroflexus aurantiacus(cau), Dehalococcoides mccartyi 195 (det), Dehalogenimonas lykanthroporepellens (dly), Herpetosiphon aurantiacus (hau), Roseiflexus sp. RS-1(rrs), Sphaerobacter thermophiles (sti), Thermomicrobium roseum(tro), Chlorobium chlorochromatii(cch), Chlorobaculum tepidum (cte), Chloroherpeton thalassium(cts), Prosthecochloris aestuarii (paa), and Pelodictyon luteolum (plt). By aligning 52 common metabolic pathways for these 14 organisms, we reconstructed phylogenetic tree T 1 . Figure 7 displays our produced tree T 1 , and Ma et al.'s tree T 2 , and the NCBI taxonomy T for these 14 organisms. Table 2 shows the similarity measures of the reconstructed tree to the NCBI taxonomy for these 14 organisms in Figure 7. Table 2. Similarity of reconstructed tree to the NCBI taxonomy T for the organisms in Figure 7.   In Figure 7, MMAL clearly separates these 14 bacteria into two broad metabolic categories: green sulfur bacteria and green non-sulfur bacteria. Green sulfur bacteria appear with the bacteria atm, cap, cau, det, dly, hau, rrs, sti, and tro. Green non-sulfur bacteria appear with the bacteria cch, cte, cts, paa, and plt. This classification result clearly characterizes the metabolic feature of green sulfur bacteria and green non-sulfur bacteria. Compared with T2, our produced tree T1 is more accurate. The reason is that as we can see from Table 2, the similarity of T1 to T is 0.20, whereas the similarity of T2 to T is 0.12. These results illustrate that our method is capable of classifying the organisms with specific metabolic characteristics, and it can obtain more accurate classification result than Ma et al.'s method for these 14 bacteria. In Figure 7, MMAL clearly separates these 14 bacteria into two broad metabolic categories: green sulfur bacteria and green non-sulfur bacteria. Green sulfur bacteria appear with the bacteria atm, cap, cau, det, dly, hau, rrs, sti, and tro. Green non-sulfur bacteria appear with the bacteria cch, cte, cts, paa, and plt. This classification result clearly characterizes the metabolic feature of green sulfur bacteria and green non-sulfur bacteria. Compared with T 2 , our produced tree T 1 is more accurate. The reason is that as we can see from Table 2, the similarity of T 1 to T is 0.20, whereas the similarity of T 2 to T is 0.12. These results illustrate that our method is capable of classifying the organisms with specific metabolic characteristics, and it can obtain more accurate classification result than Ma et al.'s method for these 14 bacteria.

Prochlorococcus and Synechococcus
Ma et al. selected 12 organisms from Prochlorococcus and Synechococcus and reconstructed phylogenetic tree T 2 for these 12 organisms [1]. For Prochlorococcus and Synechococcus, the similarity of their 16S rRNA sequences is greater than 0.96; however, they have different light-harvesting systems [1]. Prochlorococcus is composed of pmc (Prochlorococcus marinus MIT 9515), pmn (Prochlorococcus marinus NATL2A), pma (Prochlorococcus marinus SS120), pmi (Prochlorococcus marinus MIT 9312), and pmm (Prochlorococcus marinus MED4). Synechococcus is composed of syf (Synech ococcuselongatus PCC 7942), syc (Synechococcuselongatus PCC 6301), syx (Synechococcus sp. WH7803), syw (Synechococcus sp. WH8102), syd (Synechococcus sp. CC9605), syr (Synechococcus sp. RCC307), and sye (Synechococcus sp. CC9902). By aligning 64 common metabolic pathways for these 12 organisms, we reconstructed phylogenetic tree T 1 . Figure 8 shows our produced tree T 1 , and Ma et al.'s tree T 2 , and the NCBI taxonomy T for these 12 organisms. Table 3 shows the similarity measures of the reconstructed tree to the NCBI taxonomy T for these 12 organisms in Figure 8. Table 3. Similarity of reconstructed tree to the NCBI taxonomy T for the organisms in Figure 8.  Table 3 shows the similarity measures of the reconstructed tree to the NCBI taxonomy T for these 12 organisms in Figure 8. Table 3. Similarity of reconstructed tree to the NCBI taxonomy T for the organisms in Figure 8.  Concerning the organism pairs (pma, pmc), (syw, syx), and (pma, syx), the distinction between (pma, pmc) and (pma, syx) and the distinction between (syw, syx) and (pma, syx) are not obvious [1], which make it particularly hard to explicitly classify the species with high sequence similarity by the quantitative analysis [1]. As shown in Figure 8, MMAL successfully separated the organism pairs (pma, pmc) and (pma, syx) and separated the organism pairs (syw, syx) and (pma, syx) and correctly divided these 12 organisms into two broad metabolic categories Prochlorococcus and Synechococcus. This classification result clearly reflects specific metabolic characteristics among organisms and agrees well with the NCBI taxonomy. Moreover, the classification result of MMAL is Concerning the organism pairs (pma, pmc), (syw, syx), and (pma, syx), the distinction between (pma, pmc) and (pma, syx) and the distinction between (syw, syx) and (pma, syx) are not obvious [1], which make it particularly hard to explicitly classify the species with high sequence similarity by the quantitative analysis [1]. As shown in Figure 8, MMAL successfully separated the organism pairs (pma, pmc) and (pma, syx) and separated the organism pairs (syw, syx) and (pma, syx) and correctly divided these 12 organisms into two broad metabolic categories Prochlorococcus and Synechococcus. This classification result clearly reflects specific metabolic characteristics among organisms and agrees well with the NCBI taxonomy. Moreover, the classification result of MMAL is more accurate than that of Ma et al.'s method. This is because, as we can observe from Table 3, the similarity of T 1 to T is 0.14 whereas the similarity of T 2 to T is 0.10. This illustrates that MMAL can reconstruct better phylogeny that is consistent with the metabolic features of organisms and close to the NCBI taxonomy for these 12 organisms.

Methods
For given k (k > 2) organisms, we try to determine the distance between the given organisms by aligning metabolic pathways, and build the phylogenetic tree for these k organisms based on the distance. The MMAL method consists of three main phases: Phase I-construct union graph of the common pathways of the given organisms (as detailed in Section 3.1). Phase II-identify the functional modules in the union graph of the common pathways (as detailed in Section 3.2). Phase III-build the phylogenetic tree through the mapped functional modules (as detailed in Section 3.3). The flowchart of our method is shown in Figure 1. In the following, we will describe the details of each phase in our method.

Phase I-Constructing Union Graph
In order to accurately and efficiently align the common pathways of given k organisms, we try to construct the union graph G U of these common pathways to accomplish the pathway alignments. Next, we describe the construction of the union graph G U in detail.
To start with, we introduce some definitions and notations. A directed graph G p =(V(G p ),E(G p )) is used to denote metabolic pathway P, where V(G p ) is the node set of G p and E(G p ) is the directed edge set of G p , and each node in V(G p ) represents a reaction u i in P, i = 1, 2, . . . ,|V(G p )|. If an output compound of reaction u i is an input compound of reaction u j , there is a directed edge from u i to u j , j = 1, 2, . . . ,|V(G p )|. If both u i and u j are reversible, there is also a directed edge from u j to u i . Similarly, we use directed graph G p =(V(G p ),E(G p )) to denote metabolic pathway P .
In the following, we discuss the computation of similarity S(u,v). We adapt the similarity S(u,v), which was used to compute the similarity between nodes in metabolic pathways in [20,21], to compute the similarity between node u in G p and node v in G p .
where u e is the enzyme catalyzing reaction u, v e is the enzyme catalyzing reaction v, Esim(u e ,v e ) is the similarity between enzyme u e and enzyme v e , u ic , and v ic are the input compounds of u and v respectively, and u oc and v oc are the output compounds of u and v respectively. We use the enzyme similarity score [17,20,21] and compound similarity scores [20,21] to calculate the enzyme similarity Esim(u e ,v e ) and the compound similarities Csim(u ic ,v ic ) and Csim(u oc ,v oc ) respectively. Specifically, the EC identifier of an enzyme consists of four digits. Esim(u e ,v e ) equals 1 if all the four digits of the EC identifier of two enzymes are the same, Esim(u e ,v e ) equals 0.75 if the first three digits are the same, Esim(u e ,v e ) equals 0.5 if the first two digits are the same, Esim(u e ,v e ) equals 0.25 if only the first digit is the same, and Esim(u e ,v e ) equals 0 if the first digit is different [17,20,21]. Csim(u ic ,v ic ) is the average compound similarity of u ic and v ic , and Csim(u oc ,v oc ) is the average compound similarity of u oc and v oc . For instance, if C 1 and C 2 are the input compounds of u ic , and C 3 and C 4 are the input compounds of v ic , then Csim(u ic ,v ic ) = {sim(C 1 , C 3 ) + sim(C 1 , C 4 ) + sim(C 2 , C 3 ) + sim(C 2 , C 4 )}/4, where sim(A, B) is the compound similarity between compounds A and B. Similarly, we can compute Csim(u oc ,v oc ). The similarity scores of compounds are obtained from [22]. Parameters α, β and γ are used to control the balance between the weights of Esim(u e ,v e ), Csim(u ic ,v ic ) and Csim(u oc ,v oc ) with the constraint α + β + γ = 1. Here, we use α = 0.4, β = 0.3 and γ = 0.3.
Next, we discuss how to create the union graph G u of two pathways G p and G p . For edge-weighted bipartite graph G b , we regard V(G p ) and V(G p ) as two partitions of G b , respectively. Each edge of G b corresponds to a one-to-one node mapping between G p and G p . According to the similarity S(u, v) between nodes u and v, each edge connecting u and v in G b is assigned to a weight S(u, v).
Then, we employ the maximum-weight bipartite matching algorithm MWBM [23] to extract one-to-one node mappings between G p and G p . Specifically, each time MWBM selects the edge of G b with the maximal weight and extends the resulting set with the corresponding node mapping for this edge. MWBM stops when there are no more edges to be selected. We use each one-to-one node mapping (u, v) in the resulting set of node mappings to produce a composite node c_d = {(u,v)| u∈V(G p ), v∈V(G p )}. Finally, we use these composite nodes to create the union graph G u of G p and G p . Figure 9 shows an example of a union graph created for a pair of sample pathways. Note that there is no edge in the union graph. After constructing the union graph G u of Gp and Gp′, we introduce how to compute the homological similarity between composite nodes in G u and build the homological similarity matrix of G u . Let the node set of G u be V(G u )= {c_d1,c_d2,…,c_di,…,c_dn}, where c_di is a composite node in G u , i = 1, 2,…, n, and n = max{|V(Gp)|,|V(Gp′)|}. We can compute the homological similarity Sm(c_d1,c_d2) between composite nodes c_d1 and c_d2 by the following equation: where u1 and v1 are the nodes in mapping (u1, v1), u2 and v2 are the nodes in mapping (u2, v2), u1,u2V(Gp), v1,v2V(Gp′).
Subsequently, we construct an n × n homological similarity matrix H u by computing homological similarity between the composite nodes in G u , where element H u [c_di,c_dj]∈[0,1] is the homological similarity between composite nodes c_di and c_dj. Now, we elaborate on how to construct the union graph G U and obtain the homological similarity matrix of G U . Let Gi(Vi, Ei) be the common pathway of a given organism, i = 1,2,..,k, and G = {G1(V1, E1), G2(V2, E2), …,Gk(Vk, Ek)} be the set of the common pathways of given k organisms.
We first select the metabolic pathway Gmax from G, where the number of nodes of Gmax is maximal in G. Second, we iteratively use Gmax and each metabolic pathway Gi∈G to create a union graph Gi u as well as its corresponding homological similarity matrix Hi u , where the node set of Gi u is V(Gi u ) = {c_d1i, c_d2i,…,c_dni}, i= 1,…,k, and n=|V(Gmax)|. Every node in a composite node is mapped to a node of Gmax, and therefore any two nodes mapped to the same node of Gmax also construct a one-to-one node mapping. We thus build k-1 union graphs and the mappings between nodes in each pathway by using Gmax.
To this end, we obtain k−1 union graphs and k−1 homological similarity matrices of these union Figure 9. Example of union graph. The upper path is G p and the lower path is G p . u i →u j denotes that the output compound of u i is the input compound of u j , v i →v j denotes that the output compound of v i is the input compound of v j , i, j = 1, 2, 3, 4, 5, 6. V(G p ) and V(G p ) are regarded as two partitions of bipartite graph G b . Solid lines denote the edges of G b . Each edge of G b corresponds to a one-to-one node mapping between G p and G p . The mappings in rectangles are the node mappings selected by the MWBM algorithm. Each selected node mapping constructs a composite node c_d i , i = 1, 2, 3, 4, 5, 6. A union graph of G p and G p is constructed by these six composite nodes.
After constructing the union graph G u of G p and G p , we introduce how to compute the homological similarity between composite nodes in G u and build the homological similarity matrix of G u . Let the node set of G u be V(G u ) = {c_d 1 , c_d 2 , . . . , c_d i , . . . , c_d n }, where c_d i is a composite node in G u , i = 1, 2, . . . , n, and n = max{|V(G p )|,|V(G p )|}. We can compute the homological similarity S m (c_d 1 ,c_d 2 ) between composite nodes c_d 1 and c_d 2 by the following equation: where u 1 and v 1 are the nodes in mapping (u 1 , v 1 ), u 2 and v 2 are the nodes in mapping (u 2 , v 2 ), u 1 ,u 2 ∈ V(G p ), v 1 ,v 2 ∈ V(G p ). Subsequently, we construct an n × n homological similarity matrix H u by computing homological similarity between the composite nodes in G u , where element H u [c_d i ,c_d j ] ∈ [0,1] is the homological similarity between composite nodes c_d i and c_d j . Now, we elaborate on how to construct the union graph G U and obtain the homological similarity matrix of G U . Let G i (V i , E i ) be the common pathway of a given organism, i = 1, 2, . . . , k, and G = {G 1 (V 1 , E 1 ), G 2 (V 2 , E 2 ), . . . , G k (V k , E k )} be the set of the common pathways of given k organisms.
We first select the metabolic pathway G max from G, where the number of nodes of G max is maximal in G. Second, we iteratively use G max and each metabolic pathway G i ∈ G to create a union graph G i u as well as its corresponding homological similarity matrix H i u , where the node set of G i u is V(G i u ) = {c_d 1i , c_d 2i , . . . , c_d ni }, i= 1, . . . , k, and n = |V(G max )|. Every node in a composite node is mapped to a node of G max , and therefore any two nodes mapped to the same node of G max also construct a one-to-one node mapping. We thus build k-1 union graphs and the mappings between nodes in each pathway by using G max .
To this end, we obtain k − 1 union graphs and k − 1 homological similarity matrices of these union graphs. Since each node of pathway G i ∈ G is mapped to a node of G max in each composite node in the union graphs, these k − 1 union graphs can be merged into a resulting union graph by merging the composite nodes which include the same node of G max . In this way, we finally merge these k − 1 union graphs into a resulting union graph G U , where the node set of . . . ., ∪ k i=1 c _d ni , and the homological similarity matrix of G U is H u i . Aligning metabolic pathways is to find the node mappings between pathways. Any two nodes (reactions) in a composite node in G U constitute a one-to-one node mapping between any two pathways in G. We can extract all one-to-one node mappings among the common pathways of these k organisms from G U . We thus transform the alignment of multiple pathways into constructing the resulting union graph G U .

Phase II-Identifying Functionally Conserved Modules
The goal of this phase is to obtain the functional modules and their mappings in the common pathways of given k organisms. A functional module is a sub-network in metabolic pathway, which performs a certain function with specific topology. We can cluster functionally similar composite nodes in the resulting union graph to obtain such a module. The induced sub-graph of the nodes of a resulting cluster in the underlying pathway is a functionally conserved module in the pathway.
In this work, we use affinity propagation (AP) algorithm [24] to cluster the functionally similar composite nodes in the resulting union graph based on the homological similarity matrix produced in Phase I. We use the homological similarity matrix H U as the input matrix to run the AP algorithm, and obtain a set of resulting clusters for the composite nodes of G U . Each of the resulting clusters is a set U M of composite nodes. For a metabolic pathway, each induced sub-graph of the nodes of a resulting cluster in the pathway is a functionally conserved module. For example, for the resulting union graph G U built in phase I, for simplicity, we assume that a resulting . . . ., ∪ k i=1 c _d ni by using the AP algorithm based on the homological similarity matrix We refer the readers to [24] for details on using the AP algorithm.
Meanwhile, since any two nodes of M i and M j in a composite node in U M construct a one-to-one node mapping between M i and M j , M i , and M j construct a one-to-one module mapping, where i, j = 1, 2, . . . , k, and i = j. That is, if all nodes of two functional modules are included in the same resulting cluster, these two modules are mapped together. To this end, by clustering the composite nodes with similar functions in the resulting union graph G U , we identify the conserved modules and their mappings in the common pathways of k organisms.

Phase III-Building Phylogenetic Tree
Now, we describe how to create the distance matrix of k organisms by computing the similarity between the mapped functional modules in the pathways and build the phylogenetic tree based on the distance matrix.
For two metabolic pathways, the Largest Common Connected Sub-graph (LCCS) is the largest connected sub-graph of the first pathway that is isomorphic to a sub-graph of the second pathway [20]. The larger and denser connected sub-graphs are biologically more valuable. Larger numbers of nodes and edges of the LCCS in the mapped modules indicate that the LCCS in the mapped modules is larger and denser. Thus, we can use the number of nodes and edges of the LCCS in the largest mapped functional modules to measure the similarity between two pathways.
Next, we elaborate on how to measure the similarity between two pathways by comparing the similarity between the mapped functional modules in the pathways. For the pathways G i (V i ,E i ) and G j (V j ,E j ), let Mmax i and Mmax j be the mapped functional modules with the maximal number of nodes in G i (V i ,E i ) and G j (V j ,E j ) respectively. We define M_lccs i as the LCCS between G i (V i ,E i ) and G j (V j ,E j ) in Mmax i , where the node and edge sets of M_lccs i are V_lccs i and E_lccs i respectively. Similarly, we define M_lccs j as the LCCS between G i (V i ,E i ) and G j (V j ,E j ) in Mmax j , where the node and edge sets of M_lccs j are V_lccs j and E_lccs j respectively. The similarity score SimScore(G i ,G j ) between G i (V i ,E i ) and G j (V j ,E j ) is computed by equation (3).
In the following, we introduce how to compute the similarity between two organisms by the similarity between their common pathways. The given k organisms are denoted by O 1 , O 2 , . . . , and O k , and p common pathways of O i are represented by G i1 , G i2 , . . . , G ip , i = 1, 2, . . . , k. The similarity between O i and O j is computed by the following equation: SimScore(G it , G jt ) p , i, j = 1, 2, . . . . ., k Hence, by computing the similarity between any two organisms, we obtain an k × k similarity matrix BSim for given k organisms. BSim is a symmetric matrix. BSim[i,j] ∈ [0,1] is the similarity between O i and O j , i = 1, 2, . . . , k, and j = 1, 2, . . . , k. All elements in the diagonal of BSim are 1.
After obtaining similarity matrix BSim, we can compute the distance matrix D for k organisms, where D[i,j] = 1 − BSim[i,j], D[i,j] is the distance between O i and O j , i = 1, 2, . . . , k, and j = 1, 2, . . . , k. Based on the distance matrix D, we can build a phylogenetic tree for these organisms using the software tool PHYLIP [14]. Finally, we can show this tree by the visualization tool TreeView [15].

Conclusions
Although a number of phylogenetic methods have been developed, few efforts were made to provide a unified phylogenetic framework that sufficiently reflects the metabolic features of organisms. We thus propose a three-phase phylogenetic method MMAL that can characterize the metabolic features of organisms by aligning multiple metabolic pathways using functional module mapping.
MMAL distinguishes from other phylogenetic inference methods using metabolic pathway data in the following aspects. First, we transform the alignment of the metabolic pathways among multiple organisms into constructing the union graph of the pathways. Second, we identify the functional modules in the pathways and build the mappings between these modules simultaneously by clustering the nodes in the union graph. Finally, we compute the similarity between metabolic pathways by comparing the mapped functional modules in the pathways and construct the phylogenetic relationships among the organisms according to the similarity.
We have shown the effectiveness of MMAL by comparing the resulting trees with the NCBI taxonomy. The experimental results demonstrate that the use of functional module mapping enables MMAL to categorize correctly organisms into main categories with specific metabolic characteristics and the classification results of MMAL are consistent with the metabolic features of organisms. Traditional phylogenetic methods can infer evolutionary relationships, whereas our method has the capacity to explore in-depth metabolic analysis for the phyletic reconstruction, which can add insights into traditional phyletic reconstruction. The results also show that MMAL is capable of reconstructing better phylogenies in comparison to existing classification method using metabolic pathway data. It is evident that investigating functional module mapping helps to construct better phylogenies. MMAL is thus a useful method for the study of phylogenetic analysis.
The metabolic pathways are highly selected by diverse local environments, and therefore, some pathways of the organisms beyond a certain distance may be similar in sequence, which makes it difficult to distinguish the distant species with similar pathways by comparing metabolic pathways. Combining genome features, such as ribosomal RNAs and oligonucleotide compositions, to our phylogenetic framework offers a possible solution for this problem, and showing the history of the appearances of the functional modules can further improve the inference capability of MMAL, which would be of interest in our future study. Additionally, in this work, we have reconstructed phylogenies by aligning pathways based on one-to-one reaction mappings, and it would be interesting to reconstruct phylogenies by aligning pathways based on one-to-many reaction mappings.
Moreover, the study of phylogenetic relationships is not limited to metabolic networks but may also be applied to other biological networks such as protein interaction networks or transcriptional regulation networks. In the future, in order to understand the functional relations between different biological networks for different organisms, it is also interesting to quantitatively and qualitatively analyze the reconstruction of phylogeny from aligning different biological networks and explore the phenotypic differences between species from such alignments.