Comparative genomics provides a powerful tool for investigating evolutionary changes of genes, pathways and other characteristics along various lineages [
65]. Based on comparative genomic approaches, we may better understand the use of metalloproteins and metal-dependent processes in living organisms. However, identification and quantification of complete metalloproteomes for most metals is currently impossible [
66,
67]. Even so, analysis of the majority of them in genomic databases can still greatly improve our understanding of the utilization and function of metals and their variations across species during evolution. In addition, analysis of genes involved in metal uptake, homeostasis, and metal-containing cofactor biosynthesis may assist in identification of metal utilization trait (i.e., the ability to use certain metal) [
67,
68,
69]. A general procedure for comparative genomics of metal utilization is shown in
Figure 1. In the following sections, we mainly focus on metalloproteins and discuss recent progress on comparative genomic analyses of metalloproteins for several important metals.
3.1. Zinc and Iron
Zn and Fe are the two most commonly used trace metals in all organisms. A great number of proteins have been characterized or predicted to use one of the two metals (see references [
25,
26,
30] and several web resources described above for a compiled list of Zn- or Fe-binding protein families). However, because of the widespread and complex use of the two metals, comparative analyses of the occurrence and evolutionary trends of their utilization are still very challenging to handle [
69].
Zn is known to contribute to numerous biological processes in living systems, which is a key component present in hundreds of structural proteins, enzymes, transcription factors, and ribosomal proteins [
14]. Previously, Zn proteomes (including both Zn-dependent proteins and some other proteins involved in Zn transport and homeostasis) have been predicted in a limited number of prokaryotic and eukaryotic organisms based on Zn-binding domains and patterns extracted from various databases [
25,
28,
29,
70]. In general, the number of Zn-binding proteins are positively correlated with the proteome size of an organism. Eukaryotes had a higher proportion (8%~10%) of Zn-binding proteins than prokaryotes (5%~6%). The majority of prokaryotic Zn proteins perform enzymatic catalysis (especially hydrolases) while the eukaryotic Zn proteome is mainly involved in both catalysis and transcription regulation of gene expression, suggesting that Zn-binding transcription factors have evolved to regulate more complex and diverse processes in higher organisms [
28,
29]. Another study analyzed two major groups of Zn proteins (Zn finger-containing proteins and Zn hydrolytic enzymes) in more than 800 organisms, which revealed that there is a correlation in their changes during evolution related to environmental change [
71]. In recent years, comparative genomic approaches have been frequently used to investigate the distribution and diversification of certain Zn-dependent protein families, particularly Zn finger-containing transcription factors such as PRDM, Zic, and several other C2H2-Zn finger protein families, which provides a basis for further research on the origin, function, and evolutionary features of these proteins [
72,
73,
74,
75]. With an explosion in genomic resources and the rapidly expanding number of bioinformatic tools in the past decade, a more comprehensive analysis of Zn-dependent proteomes in all kingdoms of life is urgently needed.
Fe is the most abundant transition metal in cells and has a fundamental role in many metabolic processes, such as oxygen transport, electron transfer, nucleic acid synthesis, growth, and many important redox reactions [
7]. Besides Fe ions, many proteins may use Fe in the form of heme or Fe-S clusters [
7,
11]. Due to the complexity and diversity of Fe utilization, it is very hard to identify the complete Fe-dependent proteomes; however, several studies have been performed aiming at the understanding of different groups of Fe-dependent proteins in various organisms. An early bioinformatic study investigated the occurrence of putative non-heme Fe-binding proteins in a small number of prokaryotes and eukaryotes, which demonstrated that extant organisms have inherited the majority of Fe proteome from the last universal common ancestor [
26]. Compared to Zn proteome, the Fe proteome constituted a higher fraction of the proteome in archaea (7.1% on average) than in bacteria (3.9%) and in eukaryotes (1.1%). Another computational study compared the distribution of Fe-S proteins in more than 400 prokaryotic organisms with different life styles and found a strong relationship between environmental dioxygen levels and the usage of different Fe-S clusters [
31]. Very recently, the complete human Fe proteome was systematically analyzed based on different types of Fe-containing cofactors [
30]. About 2% of human genes encode Fe proteins (35%, 48%, and 17% for individual Fe ions, heme, and Fe-S clusters, respectively). Interestingly, genes encoding Fe proteins (especially Fe-S proteins) appeared to be more commonly related to pathologies than all other human genes, suggesting specific features of the physiological role of Fe. In addition, comparative genomic analyses were carried out for investigating Fe metabolism and homeostasis mechanisms in different organisms, such as cytosolic Fe-S cluster assembly machinery [
76], heme biosynthesis and uptake machinery [
77], and several other protein families involved in Fe transport and storage [
78], which provide detailed insights into the composition and evolution of Fe metabolic network.
3.2. Copper
Cu is an important activator for several key enzymes participating in fundamental biological processes such as respiration, photosynthesis, and oxidative stress responses. A number of Cu-dependent proteins (cuproproteins) have been characterized in both prokaryotes and eukaryotes. The currently known cuproprotein families are shown in
Table 3 (proteins involved in Cu transport and homeostasis are not included).
Comparative genomic studies have been previously carried out to analyze intrinsic features of different cuproproteins or cuproproteomes (the whole set of cuproproteins) in various organisms [
27,
79,
80,
81,
82]. Two early studies combined known Cu-binding domains and Cu-binding patterns to explore the occurrence of Cu proteins (including both cuproproteins and some other proteins involved in Cu transport and homeostasis) in a limited number of sequenced genomes [
27,
79]. The proportion of Cu-binding proteins was small when compared to that of Zn or non-heme Fe proteins. Eukaryotes have expanded the Cu proteome inherited from the last common ancestor of all organisms by evolving new Cu domains and reusing old domains for novel functions.
Some other studies provide more detailed information about cuproproteins in the three domains of life [
67,
80,
81]. Cytochrome c oxidase subunits I (COX I) and II (COX II) are the most widely distributed cuproproteins in prokaryotes (
Figure 2A). Multicopper oxidase (MCO), Cu-Zn superoxide dismutase (Cu-Zn SOD) and plastocyanin families were also found in many prokaryotes, whereas the occurrence of tyrosinase, nitrosocyanin, Cu amine oxidase, and particulate methane monooxygenase seems to be quite limited. Except for cuproproteins that were exclusively present in individual kingdoms (e.g., azurin in bacteria and rusticyanin in archaea), significant difference in the distribution of most cuproproteins was observed between archaea and bacteria (
Figure 2A). On the other hand, only half of prokaryotic cuproprotein families could be found in eukaryotic organisms, and the latter have the capacity to evolve new cuproproteins such as galactose oxidase, hemocyanin, and plantacyanin. MCO, COX I, COX II, and Cu-Zn SOD were the most abundant cuproprotein families in eukaryotes, while the distribution of some cuproproteins appeared to be phylum-specific, e.g., hemocyanin in arthropods and plantacyanin in land plants. A recent comparative analysis of the presence of hemocyanin in different myriapod species suggests that these proteins have divergent evolutionary patterns in different myriapod taxa [
82]. Further analysis of prokaryotic cuproproteomes revealed that larger cuproproteomes were mainly present in Alphaproteobacteria, Betaproteobacteria and Euryarchaeota/Halobacteriales. The largest bacterial and archaeal cuproproteomes reported to date were detected in several
Sinorhizobium species (
S. medicae and
S. meliloti, 22 cuproprotein genes) and
Haloarcula marismortui (25 cuproprotein genes), respectively [
81]. In eukaryotes, land plants possessed the largest cuproproteomes, especially
Oryza sativa containing 78 cuproprotein genes). It is interesting that larger cuproproteomes were mainly found in organisms living in oxygen-rich environments, which is consistent with the idea that proteins evolved to use Cu following the oxygenation of the Earth [
80,
81,
82]. Because previous studies relied only on a limited number of organisms, future research is needed to update the distribution and evolution of cuproproteins/cuproproteomes using a much wider range of sequenced genomes belonging to different clades.
3.3. Molybdenum and Tungsten
Mo is required for the activity of a number of molybdoproteins that catalyze diverse reactions in the metabolism of carbon, nitrogen, and sulfur compounds [
9]. With the exception of Fe-Mo-containing nitrogenase, Mo needs to be bound to a specific pyranopterin moiety to form Moco, an active compound at the active site of all molybdoproteins [
84]. Some prokaryotes (such as hyperthermophilic archaea) use W to replace Mo, which is bound to the same pyranopterin to form tungstoproteins [
85]. A list of known molybdoprotein and tungstoprotein families is shown in
Table 3. Each family may contain a variety of enzymes [
86,
87]. It has been suggested that MOSC (Moco sulfurase C-terminal domain)-containing proteins are new members of the sulfite oxidase (SO) family due to similar structures for Mo-binding domains [
86]; however, the lack of significant sequence similarity between them may challenge such an alternative classification approach [
88].
Several comparative genomic studies have been conducted to explore the distribution and evolution of Mo utilization trait and molybdoproteins in all domains of life, which give preliminary indications of how this transition element is used by different organisms [
81,
89,
90]. Very recently, the occurrence of all known molybdoprotein families in nearly 6000 sequenced prokaryotes and eukaryotes was analyzed, which presents a much more comprehensive view of the evolutionary trajectories of molybdoproteins in nature [
83]. Dimethylsulfoxide reductase (DMSOR) is the most widespread molybdoprotein family in both archaea and bacteria, which was present in more than 90% Mo-utilizing organisms (
Figure 2B). MOSC-containing protein, xanthine oxidase (XO), and SO families were also widespread in the majority of Mo-utilizing bacteria; however, most sequenced archaea do not have MOSC-containing protein and XO families. Several new domain fusions were detected for different members of DMSOR, SO, and XO in prokaryotes, providing valuable information for the inference of protein interactions and functions. The Fe-Mo-containing nitrogenase was only needed by a small number of bacteria and methanogenic archaea. On the other hand, MOSC-containing protein (or named mARC), SO, and XO are the three eukaryotic molybdeoprotein families, all of which were detected in almost all organisms that use Mo, indicating that they are all critical for maintaining the function of Mo in this kingdom. With regard to molybdoproteomes, many organisms in Actinobacteria and several subclasses of Proteobacteria were molybdoprotein-rich organisms (>20 molybdoprotein genes). To date, the largest molybdoproteome in bacteria was found in
Gordonibacter pamelaeae 7-10-1-b (73 molybdoprotein genes, mostly belonging to the DMSOR family) [
83]. In contrast, very few molybdoprotein-rich organisms were observed in archaea and eukaryotes. Further examination of the relationship between environmental factors and molybdoproteins revealed that the majority of molybdoprotein families and large molybdoproteomes are more frequently present in aerobic organisms, implying that oxygen has played a crucial role in the evolution of molybdoprotein genes [
83].
Although it is still very difficult to distinguish between Mo and W utilization in different members of molybdoproteins due to quite similar physical-chemical and functional properties [
91], it is worth mentioning that several attempts have been made to identify tungstoproteins from molybdoprotein families based on recent advances on tungstoproteins [
83,
92]. The currently known tungstoproteins include nearly all enzymes of the aldehyde:ferredoxin oxidoreductase (AOR) family and certain enzymes of the DMSOR family, including formate dehydrogenase and acetylene hydratase from strictly anaerobic bacteria and formylmethanofuran dehydrogenase from methanogenic archaea [
92,
93,
94]. Preliminary analysis of tungstoproteins in prokaryotes revealed that AOR could be detected in the majority of W-utilizing prokaryotes while W-containing DMSOR proteins were present in most W-utilizing bacteria and a small number of archaea (mainly methanogens) (
Figure 2C) [
83]. These exploratory studies may provide the first global view of W utilization in prokaryotes.
3.4. Nickel and Cobalt
Ni is an essential cofactor for several enzymes that play critical roles in energy and nitrogen metabolism [
15,
95]. Some other Ni-containing proteins, such as glyoxalase I and acireductone dioxygenase, are not strictly Ni-dependent proteins which may bind alternative metals in different or even same organisms [
95]. Co is mainly used as a key component of cobalamin (or called vitamin B
12), which encompasses a group of closely related corrinoid compounds found in enzymes that mediate methyl transfer reactions, isomerase rearrangements, dehalogenation, and some other processes [
96,
97,
98]. Moreover, Co is also detected in several non-corrin Co-containing enzymes in certain organisms, which may use other metals (such as Zn and Fe) to replace Co in many other organisms [
99]. In this review, we only discuss strictly Ni-dependent and B
12-binding protein families which are shown in
Table 3.
To our knowledge, only few comparative genomic studies have been conducted on Ni- or Co-dependent metalloproteins in a wide range of organisms from the three domains of life [
81,
100,
101]. As prokaryotes use similar import systems for Ni and Co uptake [
102,
103], the utilization of the two trace metals could be highly correlated, which was supported by the observation that most prokaryotic organisms use both metals [
81]. In bacteria, urease and methionine synthase (MetH) were the most frequently used Ni- and Co-dependent protein families, respectively (
Figure 2D,E). However, they seemed to be rare or even absent in archaea, in which Ni-Fe hydrogenase and B
12-dependent ribonucleotide reductase class II (RNR II) were the most commonly used Ni and Co enzymes. Except for a small number of organisms (such as deltaproteobacteria and Methanosarcina species), most prokaryotes possessed no more than 5 Ni- and/or Co-dependent metalloprotein genes. The largest Ni-dependent proteome was previously reported in
Deltaproteobacterium MLMS-1 (16 Ni-binding protein genes, half were Ni-Fe hydrogenases) and the largest B
12-dependent proteome in
Dehalococcoides sp. CBDB1 (35 B
12-dependent protein genes, 32 were reductive dehalogenase CprA proteins) [
81,
100]. Another recent study analyzed the distribution of vitamin B
12 production pathway and a variety of B
12-dependent enzymes in over 11,000 bacterial species, which provides important information on B
12 utilization and its evolution in a much wider prokaryotic range [
104]. Approximately 86% of the examined bacteria contained B
12-dependent enzyme families, most of which lacked the ability to synthesize B
12 and have to obtain this cofactor from exogenous sources. Proteobacteria and Bacteroidetes appeared to have larger numbers of B
12-dependent enzymes than others.
In contrast to prokaryotes, the utilization of Ni and Co is quite restricted in eukaryotes, and very few organisms utilize both metals [
81]. Only one Ni-dependent enzyme (urease) and three B
12-dependent enzymes (methylmalonyl-CoA mutase, RNR II, and MetH) have been reported in this kingdom (
Figure 2D,E). Urease and MetH were present in all Ni- and Co-utilizing eukaryotes, respectively. Analysis of Ni- and Co-dependent metalloproteomes did not reveal organisms that contained many of these proteins. Interestingly, compared to the majority of unicellular organisms that lack B
12-binding proteins,
Dictyostelium discoideum and several
Phytophthora species contained all the three known eukaryotic B
12-dependent enzymes, implying a more important role of B
12 cofactor in these organisms [
81]. In the future, it is necessary to perform more comprehensive surveys on the two metals using newly generated genomic resources.
3.5. Selenium
Se is a metalloid trace element, which is essential for normal physiological functions in humans, animals, and many other organisms [
5,
105]. It mainly occurs in the form of Sec, which is a key component of selenoproteins involved in numerous enzymatic reactions, such as redox homeostasis, thyroid hormone metabolism, anti-inflammatory actions, and reproduction [
106,
107]. The mechanism of Sec biosynthesis and its incorporation into proteins has been elucidated in both prokaryotes and eukaryotes [
21,
22]. So far, a significant number of selenoproteins have been reported in various organisms from bacteria to mammals, many of which were identified using reliable bioinformatic approaches [
50,
51,
52,
57,
108,
109].
Table 4 lists the majority of known and putative selenoproteins. Although the functions of most selenoproteins are not known and could only be inferred by sequence homology, it is very likely that most of them play important roles in antioxidation and detoxification [
106].
Previously, several computational and comparative genomic approaches have been carried out to investigate the distribution and evolution of Se metabolic pathways and selenoproteins in a large number of prokaryotic organisms and selected environmental samples [
23,
81,
110,
111,
112,
113,
114,
115,
116], which provide detailed information on how this element is selectively used by proteins and organisms from different kingdoms. An early work analyzed the Sec biosynthetic pathway and known selenoproteins in several hundred bacterial and archaeal genomes, and found that only one-fourth of the examined organisms have selenoprotein genes. Most selenoprotein-rich organisms belong to Deltaproteobacteria and Clostridia [
81]. Recently, a much more extensive evaluation has been conducted on Se metabolism and selenoproteins in bacteria by analyzing more than 5200 genomes, which generated the largest map of Se utilization in this kingdom [
114]. More than 60 selenoprotein families/subfamilies could be detected in bacteria. Formate dehydrogenase alpha subunit and selenophosphate synthetase were the most widespread bacterial selenoprotein families (
Figure 3). A new selenoprotein-rich phylum Synergistetes and additional selenoprotein-rich organisms have also been identified. The largest bacterial selenoproteome was found in
Syntrophobacter fumaroxidans, a syntrophic propionate-oxidizing deltaproteobacterium containing 39 selenoprotein genes [
81,
114]. Although both aerobic and anaerobic organisms could use Sec, the fact that most selenoprotein-rich organisms (78.3%) are obligate or facultative anaerobic suggests a somewhat stronger correlation between evolution of selenoprotein genes and low oxygen level [
114].
In archaea, selenoprotein genes were only detected in a small number of organisms belonging to three phyla: Methanococcales, Methanopyrales, and Lokiarchaeota [
81,
116,
117,
118]. Compared to bacteria which contain a variety of known or predicted selenoprotein families, only nine selenoprotein families have been discovered in archaea, most of which are involved in methanogenesis [
117]. The archaeal selenoproteomes show a relatively narrow size distribution (7~12 selenoproteins). Lokiarchaeota, a novel archaeal phylum and the closest archaeal relative to eukaryotes, was reported to have the largest archaeal selenoproteome (at least 12 selenoprotein genes) [
118]. Further analysis of Lokiarchaeota selenoprotein genes suggests that this archaeon may serve as an intermediate form between the typical archaeal and eukaryotic Sec biosynthesis systems, providing new clues for the origin and evolution of the Sec utilization trait.
More efforts have been made to explore the distribution and evolution of selenoproteins in eukaryotes [
81,
119,
120,
121,
122,
123,
124]. Several early comparative studies demonstrated that many selenoprotein families, such as glutathione peroxidases (GPXs), thioredoxin reductases (TXNRDs), and selenophosphate synthase 2 (SEPHS2) are shared between single-cell eukaryotes (such as green algae and many protists) and vertebrates, implying that the majority of eukaryotic selenoproteins originated from the ancestors of current eukaryotes and have been preserved throughout evolution [
81,
119,
120,
121]. However, massive and independent selenoprotein gene loss events (either loss of selenoprotein genes or replacement of Sec with Cys residue) were observed in different lineages such as fungi, land plants, nematodes, and some other organisms [
119,
120]. The size of eukaryotic selenoproteomes varies greatly between species. With the exception of mammals, aquatic organisms (such as algae and fish) generally have larger selenoproteomes than terrestrial ones (such as insects and nematodes). Although parallel loss of Sec utilization was observed in different groups of algae [
122], the largest eukaryotic selenoproteome was described in the harmful pelagophyte alga
Aureococcus anophagefferens (containing at least 59 selenoprotein genes) [
125]. In animals, amphioxus was found to have the most abundant and diverse selenoproteins (containing 40 selenoprotein genes) [
121]. Further investigation of selenoproteins in sequenced vertebrates defined the ancestral vertebrate (28 selenoproteins) and mammalian (25 selenoproteins) selenoproteomes, and reconstructed their evolutionary history [
120]. For example, mammalian TXNRD1 and TXNRD3 were found to have evolved from an ancestral glutaredoxin-domain-containing enzyme, and selenoprotein V and GPX6 appeared at the root of placental mammals by duplications of selenoprotein W and GPX3, respectively. By evaluating the potential forces for selenoprotein gain or loss and for substitutions between Sec and Cys residues in different vertebrate clades, it was proposed that the strength of natural selection on selenoprotein genes is distinct between land vertebrates and teleost fishes, suggesting that Se availability has shaped the evolution of vertebrates [
124]. In addition, selenoprotein P (SELENOP), the only human selenoprotein with multiple Sec residues, has been suggested to function as a genetic marker of Se utilization in animals (i.e., its number of Sec residues correlates with the selenoproteome size) [
126]. A recent study showed that SELENOP genes are present across metazoan lineages with highly variable numbers of Sec-TGA codons, ranging from a single Sec residue in certain insects to up to 132 in bivalve mollusks, implying a highly dynamic evolutionary process of this selenoprotein [
127]. Very recently, it was also reported that Sec could be encoded by several early-branching fungal phyla, which provides new insights into the evolution of Sec utilization in fungi [
128].
Theoretically, comparative genomic approaches could be applied to study the metabolism of all trace metals and to identify the corresponding metalloproteins. However, due to limited knowledge about metal-binding sites and related properties for several other metals such as Mn, Cr, and V, metalloproteins that are strictly dependent on them remain poorly defined. For example, Mn is known to serve as a substitute for some other metals (e.g., Zn and Mg) in the active sites of numerous enzymes, resulting in the difficulty to distinguish Mn-dependent proteins from other metalloproteins [
129]. Only a few proteins have been reported to bind Cr or V in certain organisms, including Cr-containing oligopeptide chromodulin and V-containing vanabins and haloperoxidases [
130,
131]; however, it is unclear whether these proteins are strictly dependent on the corresponding metal in other organisms. Therefore, comparative analysis of these metalloproteins seems to be a hard task and needs to be solved in the future.