Open Issues for Protein Function Assignment in Haloferax volcanii and Other Halophilic Archaea

Background: Annotation ambiguities and annotation errors are a general challenge in genomics. While a reliable protein function assignment can be obtained by experimental characterization, this is expensive and time-consuming, and the number of such Gold Standard Proteins (GSP) with experimental support remains very low compared to proteins annotated by sequence homology, usually through automated pipelines. Even a GSP may give a misleading assignment when used as a reference: the homolog may be close enough to support isofunctionality, but the substrate of the GSP is absent from the species being annotated. In such cases, the enzymes cannot be isofunctional. Here, we examined a variety of such issues in halophilic archaea (class Halobacteria), with a strong focus on the model haloarchaeon Haloferax volcanii. Results: Annotated proteins of Hfx. volcanii were identified for which public databases tend to assign a function that is probably incorrect. In some cases, an alternative, probably correct, function can be predicted or inferred from the available evidence, but this has not been adopted by public databases because experimental validation is lacking. In other cases, a probably invalid specific function is predicted by homology, and while there is evidence that this assigned function is unlikely, the true function remains elusive. We listed 50 of those cases, each with detailed background information, so that a conclusion about the most likely biological function can be drawn. For reasons of brevity and comprehension, only the key aspects are listed in the main text, with detailed information being provided in a corresponding section of the Supplementary Materials. Conclusions: Compiling, describing and summarizing these open annotation issues and functional predictions will benefit the scientific community in the general effort to improve the evaluation of protein function assignments and more thoroughly detail them. By highlighting the gaps and likely annotation errors currently in the databases, we hope this study will provide a framework for experimentalists to systematically confirm (or disprove) our function predictions or to uncover yet more unexpected functions.

Genome annotations are frequently compromised by annotation errors [11,[63][64][65]. Many of these errors are caused by an invalid annotation transfer between presumed assimilation [90]. However, in Hbt. salinarum, this metabolic process for Fdx reoxidation does not exist.
(c) The nuo cluster of haloarchaea resembles that of E. coli, a type I NADH dehydrogenase, with the genes and gene order highly conserved and just a few domain fissions and fusions. However, haloarchaea lack NuoEFG [91], which is a subcomplex that mediates interaction with NADH [92,93]. Thus, the haloarchaeal nuo complex is unlikely to function as NADH dehydrogenase, despite its annotation as such in KEGG (as of April 2021).
(d) Other catabolic enzymes generate NADH, which must also be reoxidized. Based on inhibitor studies, NADH is not reoxidized by a type I but, rather, by a type II NADH dehydrogenase in Hbt. salinarum [82]. A tentative gene assignment has been made for Natronomonas pharaonis [66]. However, for reasons detailed in Supplementary Text S1 Section S1, this assignment is highly questionable, so this issue calls for an experimental analysis.
(e) About one-third of the haloarchaea, especially the Natrialbales, do not code for a complex III equivalent (the cytochrome bc 1 complex encoded by petABC), according to OrthoDB analysis. The bc 1 complex is required to transfer electrons from the lipidembedded two-electron carrier (menaquinone in haloarchaea) to the one-electron carrier associated with terminal oxidases (probably halocyanin). How electrons flow in the absence of a complex III equivalent is currently unresolved.
The haloarchaeal petABC genes resemble those of the chloroplast b6-f complex rather than those of the mitochondrial bc 1 complex (see Supplementary Text S1 Section S1 for more details).
(f) A bc cytochrome was purified from Nmn. pharaonis, but with an atypical 1:1 ratio between the b-type and c-type hemes [81]. The complex is heterodimeric, with subunits of 18 kDa and 14 kDa. The 18-kDa subunit carries the covalently attached heme group [81]. An attempt was made to identify the genes coding for these subunits [94] (for details, see Supplementary Text S1 Section S1). Two approaches were used to obtain protein sequence data, one being the N-terminal protein sequencing of the two subunits extracted from a SDS-polyacrylamide gel. In the other attempt, peptides from the purified complex were separated by HPLC, and a peptide which absorbed at 280 nm (protein), as well as 400 nm (heme), was isolated. Absorption at 400 nm clearly indicates that the isolated peptide contains a covalently attached heme group. The sequences from the two approaches overlapped and resulted in a contiguous sequence of 41 aa, with only the penultimate position remaining undefined [94]. Based on this information, a PCR probe was generated (designated "cyt-C Sonde") that allowed the gene to be identified and sequenced, including its genomic neighborhood. It turned out that the genes coding for the four subunits of succinate dehydrogenase (sdhCDBA) were isolated. The obtained protein sequence corresponds to the N-terminal region of sdhD (with the initiator methionine cleaved off) and only two sequence discrepancies, in addition to the unresolved penultimate residue.
In the PhD thesis [94], this unambiguous result was rated to be a failure (and the data were never formally published). The reason is that SdhD is free of cysteine residues, while standard textbooks state that a pair of cysteines is required for covalent heme attachment [95]. The lack of the required cysteine pair was taken to indicate that the results were incorrect and that the identified genes did not encode the cytochrome bc that the study was seeking [94]. In contrast, we speculate that the results were completely correct, despite being in conflict with the cysteine pair paradigm. In our opinion, a paradigm shift is required. The obtained results call for a yet-unanticipated novel mode of covalent heme attachment, exemplified by the 18-kDa subunit of Natronomonas succinate dehydrogenase subunit SdhD. It should be noted that the 41-aa protein sequence, which was obtained, turned out to contain three histidine residues upon translation of the gene, but none of these were detected upon Edman degradation.
In Halobacterium, a small c-type cytochrome was purified (cytochrome c 552 , 14.1 kDa) [96]. Heme staining after SDS-PAGE indicated a covalent heme attachment, but no sequence or composition data were reported, so that it was not possible to identify the protein based on the available information. We speculate that the Halobacterium cytochrome c 552 also represents SdhD (as detailed in Supplementary Text S1 Section S1). In that case, the proposed novel type of covalent heme attachment would not be restricted to Nmn. pharaonis but might be a general property of haloarchaea. This would also solve the "Halobacterium paradox" [95].
(g) The haloarchaeal one-electron carrier is the copper protein halocyanin rather than the iron-containing heme protein cytochrome-c. A halocyanin from Nmn. pharaonis (NP_3954A) was characterized, including its redox potential [97][98][99]. A gene fusion supports the close connection of a halocyanin with a subunit of a terminal oxidase. For further details, see Supplementary Text S1 Section S1.
(h) Terminal oxidases are highly diverse in haloarchaea, and we restricted our analysis to three species (Nmn. pharaonis, Hfx. volcanii and Hbt. salinarum), because in each of these, at least one terminal oxidase has been experimentally studied ( Table 1). The details are described in Supplementary Text S1 Section S1 with subunits of all analyzed terminal oxidases listed in Supplementary Table S1. The column Section refers to the table listing the protein and to the section in the Results and in Supplementary Text S1. As an example, 2c covers topic (c) from the decimal-numbered Results Section 3.2. Amino Acid Biosynthesis. In Supplementary Text S1, this is covered under Section S2 subsection S2.c. The corresponding proteins are listed in Table 2. For a few proteins, two sections are indicated (e.g., 1a/1b). The column Code refers to a haloarchaeal protein by its locus tag, which is mainly from Haloferax volcanii (HVO) but, also, from Halobacterium salinarum (OE), Natronomonas pharaonis (NP) and Halohasta litchfieldiae (halTADL). When the reconstruction of a complete pathway is presented, the unassigned genes are indicated as a "pathway gap". In one case, we indicate the absence of a haloarchaeal ortholog by a dash. In the case of a complex, we either list more than one code or we list only one subunit together with the term (complex). All subunits of these complexes are listed groupwise in Table S1. A protein may be shown in more than one row. From the 2nd row onwards, this is indicated by the term (cont.). The column Gene lists the assigned gene or a dash if no gene has been assigned. The assigned gene is only indicated in the first row of a protein. A set of four columns is used to relate a query protein to an experimentally characterized homolog, a GSP (Gold Standard Protein) (isofunc, %seq_id, Locus tag, UniProt). The column isofunc indicates if the query protein and its Gold Standard Protein homolog are isofunctional. The meanings of the terms used in this column in Tables 1-9 (yes, no, yes/no, probably, possibly, unclear, unknown, prediction, special and "-") are described at the end of this legend. The column %seq_id indicates the protein sequence identity between the query protein and the homologous GSP. The column Locus tag contains the locus tag, if assigned. The column UniProt contains the UniProt accession of the GSP. GSPs are experimentally characterized as described in a publication. The column Reference links to the reference list of the manuscript. The column PMID lists the PubMed ID of the publication, if available. Otherwise, this is indicated as "not in PubMed". Additionally, one PhD thesis is indicated (PhD_Mattar). The column Comment provides various types of additional information. The terms used in the column isofunc in Tables 1-9 have the following meanings: The term "yes" indicates that we consider the two proteins as isofunctional and annotate the query protein accordingly. The term "no" is used when we conclude that the proteins differ in function. Additional terms are used for more difficult cases. The term "yes/no" is used for GSPs that are multifunctional, and we assign only a subset of these functions to the query protein. The term "probably" is used when we consider it likely that the proteins are isofunctional and annotated the query protein accordingly (with the term probable added to the protein name). The term "possibly" is used when we see a good chance that the proteins are isofunctional but consider it too speculative to annotate the protein accordingly. The term "unclear" is used when we consider it likely that the same overall reaction is catalyzed but when reaction details, e.g., the energy-providing compound, are unresolved. The term "unknown" is used when it is not possible to predict the substrate of the query protein. The term "prediction" is used if a function assignment is based on bioinformatic analyses but not yet on an experimentally characterized homologous protein. The term "special" is used when multiple arguments have to be considered, with the full details provided in the corresponding section of Supplementary Text S1. Finally, a hyphen ("-") is used when isofunctionality does not apply, e.g., when a homologous Gold Standard Protein could not be identified.
(i) NAD-dependent oxidative decarboxylation is a canonical reaction to convert pyruvate into acetyl-CoA and α-ketoglutarate into succinyl-CoA. In haloarchaea, the conversion of pyruvate to acetyl-CoA and α-ketoglutarate to succinyl-CoA is dependent on ferredoxin, not on NAD (see above). Nevertheless, most haloarchaeal genomes also code for homologs of enzymes catalyzing NAD-dependent oxidative decarboxylation, such as the E. coli pyruvate dehydrogenase complex. In most cases, the substrates could not be identified, an exception being a paralog involved in isoleucine catabolism [116]. In several cases, the enzymes were found not to show catalytic activity with pyruvate or α-ketoglutarate (see Supplementary Text S1 Section S1 for details). Additionally, a conditional lethal porAB mutant was unable to grow on glucose or pyruvate, thus excluding that alternative enzymes for the conversion of pyruvate to acetyl-CoA exist in Hfx. volcanii [22]. Nonetheless, despite experimental results to the contrary, pyruvate has been assigned as a substrate for some of the homologs of the pyruvate dehydrogenase complex in KEGG (as of April 2021).

Amino Acid Metabolism
While most amino acid biosynthesis and degradation pathways can be reliably reconstructed, a few open issues remain, which are discussed below.
(a) The first and last steps of arginine biosynthesis deal with blocking and unblocking of the α-amino group of the substrate (glutamate) and a product intermediate (ornithine). As detailed in Supplementary Text S1 Section S2, it is highly likely that glutamate is attached to the γ-carboxyl group of a carrier protein, and ornithine is released from that carrier protein. This is based on characterized proteins from Thermus thermophilus [124], Thermococcus kodakarensis [125] and Sulfolobus acidocaldarius [126]. The assignment is strongly supported by clustering of the arginine biosynthesis genes. Some of the homologs are bifunctional, being involved in arginine biosynthesis but, also, in lysine biosynthesis via the prokaryotic variant of the α-aminoadipate pathway. This ambiguity is not assumed to occur in haloarchaea, which use the diaminopimelate pathway for lysine biosynthesis [127] (see Supplementary Text S1 Section S2 for further discussion of this issue).
Expanding the above, we provided full details underlying our reconstruction of arginine and lysine biosynthesis in Hfx. volcanii in Table 2.   no GSP (b) Archaea use a different precursor for aromatic amino acid biosynthesis than the classical pathway. This has been resolved for Methanocaldococcus jannaschii and for Methanococcus maripaludis [146,156]. However, the initial steps may differ from those reported for Methanocaldococcus in that fructose 1,6-bisphosphate, rather than 6-deoxy-5ketofructose, might be a substrate [145]. Up to now, a clean deletion of the corresponding enzymes and confirmation with in vitro assays has not yet been achieved (for details, see Supplementary Text S1 Section S2).
(c) The gene for tryptophanase (tpa) is stringently regulated in Haloferax, which is the basis for using its promoter in the toolbox for regulated gene expression [157]. The shutdown of this gene avoids tryptophan degradation when supplies are scarce. Tryptophanase cleaves tryptophan into indole, pyruvate and ammonia. The fate of indole is, however, yet unresolved.
(d) A probable histidine utilization cluster exists, based on the characterized homologs from Bacillus subtilis, but has not yet been experimentally verified.
(e) Among the 16 auxotrophic mutants observed in a Hfx. volcanii transposon insertion library [9], some could grow only in the presence of one (or several) supplied amino acids. In many cases, the affected genes were known to be involved in the corresponding pathway, but the others may lead to novel function assignments. One affected gene resulted in histidine auxotrophy, and the product of this gene (HVO_0431) is an interesting candidate. The InterPro domain assignment (HAD family hydrolase) fits into the only remaining pathway gap in histidine biosynthesis (histidinol-phosphatase). In this context, it should be noted that the enzyme that catalyzes the preceding reaction (encoded by hisC) is part of a highly conserved three-gene operon involved in polar lipid biosynthesis (see below). For details, see Supplementary Text S1 Section S2. One affected gene resulted in isoleucine auxotrophy. The product of this gene (HVO_0644) is currently annotated to catalyze two reactions, one being an early step in isoleucine biosynthesis (EC 2.3.1.182) and the other being the first step after leucine biosynthesis branches off from valine biosynthesis (EC 2.3.3.13) (see below, (f)) (for details, see Supplementary Text S1 Section S2).
(f) Hfx. volcanii codes for two paralogs with an attributed function as 2-isopropylmalate synthase (EC 2.3.3.13). This is the first reaction specific to leucine biosynthesis when the pathway branches off valine biosynthesis. One paralog, HVO_0644, is annotated as bifunctional, also catalyzing a chemically similar reaction that is an early step in isoleucine biosynthesis (EC 2.3.1.182). When the gene encoding HVO_0644 is disrupted by transposon integration, cells cannot grow in the absence of isoleucine. It is unclear if the protein is really bifunctional and is really involved in leucine biosynthesis, catalyzing the reaction of EC 2.3.3.13. The other paralog, HVO_1510, belongs to an ortholog set with major problems concerning the start codon assignment. The ortholog set from the 16 genomes listed in Supplementary Table S2 was analyzed. When only canonical start codons are considered (ATG, GTG and TTG), the orthologs from Haloferax mediterranei, Nmn. pharaonis, Natronomonas moolapensis and Halohasta litchfieldiae either lack a long highly conserved N-terminal region or they are disrupted (pseudogenes), being devoid of a potential start codon. The gene from Hfx. volcanii has a start codon (GTG) that is consistent with that of Haloferax gibbonsii strain LR2-5 (but a GTA in Hfx. gibbonsii strain ARA6). In this region, the gene from Hfx. mediterranei is closely related but has in-frame stop codons. HVO_1510 is considerably longer than the orthologs from Haloquadratum walsbyi, Haloarcula hispanica and Natrialba magadii. The first alternative start codon for HVO_1510 codes for Met-93. This protein was proteomically identified in three ArcPP datasets [2], and peptides upstream of Met-93 were identified. This gene might be translated from an atypical start codon, either an in-frame CTG or an out-of-frame ATG, which would require ribosomal slippage (for details, see Supplementary Text S1 Section S2 and Supplementary Figure S1). It is tempting to speculate that translation occurs only when leucine is not available.

Coenzymes I: Cobalamin and Heme
The classical heme biosynthesis pathway branches off cobalamin biosynthesis at the level of uroporphyrinogen III. A second pathway exists in bacteria (CPD pathway). Haloarchaea use the alternative heme biosynthesis pathway [158], which has an additional common step with cobalamin biosynthesis, the conversion of uroporphyrinogen III to precorrin-2. For heme biosynthesis, precorrin-2 is converted into siroheme. This pathway was reconstructed [159], except for the iron insertion step. For de novo cobalamin biosynthesis, haloarchaea use the cobalt-early pathway. A key reaction in this pathway variant, catalyzed by CbiG, is cobalt-dependent. Thus, cobalt must be inserted early and is present in all intermediates [160]. Several aspects of heme and cobalamin biosynthesis in haloarchaea have yet to be resolved. This is illustrated in Figure 1.
corrin-2. For heme biosynthesis, precorrin-2 is converted into siroheme. This pathway was reconstructed [159], except for the iron insertion step. For de novo cobalamin biosynthesis, haloarchaea use the cobalt-early pathway. A key reaction in this pathway variant, catalyzed by CbiG, is cobalt-dependent. Thus, cobalt must be inserted early and is present in all intermediates [160]. Several aspects of heme and cobalamin biosynthesis in haloarchaea have yet to be resolved. This is illustrated in Figure 1. are not displayed. The circle for sirohydrochlorin is highlighted in red, as this is the branchpoint for heme and cobalamin biosynthesis in haloarchaea. Enzymatic reactions are shown by arrows, the EC numbers being provided in rectangular boxes. Rectangles are colored when the enzyme has been reconstructed for haloarchaea (blue: heme biosynthesis; dark yellow: de novo cobalamin biosynthesis; light yellow: late cobaltochelatase, which may be a salvage reaction). Gene names in green are adopted from KEGG and represent those from bacterial model pathways. Consecutive arrowheads indicate reaction series that are not shown in detail for space reasons. Additionally, some enzymes of the heme biosynthesis pathway are omitted for space reasons. For enzymatic reactions that are considered to be open issues, Hfx. volcanii locus tags are provided. For two pathway gaps (white boxes in the cobalt-early pathway), the type of reaction is indicated (oxidoreductase and~CH3, indicating a methylation reaction). The question mark after HVO_B0058 indicates that this protein, currently co-attributed to EC 2.1.1.272, is a candidate for the yet-unassigned EC 2.1.1.195 reaction. We note that haloarchaea might use a deviating biosynthesis pathway, e.g., by swapping the methylation and oxidoreductase reactions (not illustrated). (B) The major cobalamin cluster, encoded on megaplasmid pHV3. Arrows are used to indicate the coding strand and are roughly drawn to scale. If assigned, the gene name is provided in addition to the Hfx. volcanii locus tag. Locus tags in red indicate genes that are part of the cobalamin cluster.
(a) Hfx. volcanii contains two annotated cbiX genes. For the reasons detailed in Supplementary Text S1 Section S3, we predict that one is a cobaltochelatase, involved in cobalamin biosynthesis, while the other is a ferrochelatase, responsible for the conversion of precorrin-2 to siroheme in the alternative heme biosynthesis pathway.
(b) De novo cobalamin biosynthesis has been extensively reconstructed upon curation of the genome annotation [11]. All enzymes of the pathway and their associated GSPs are listed in Table 3. Only two pathway gaps remained, and because these are consecutive, it may be possible that the haloarchaeal pathway is noncanonical and proceeds via a novel biosynthetic intermediate. There are only four genes with yet-unassigned functions in the Hfx. volcanii cobalamin gene cluster, and their synteny is well-conserved in the majority of haloarchaeal genomes. Thus, these genes are obvious candidates for filling the pathway gaps (for details, see Supplementary Text S1 Section S3).
(c) The cobalamin biosynthesis and salvage reactions (those beyond ligand cobyrinate a,c diamide) involve "adenosylation of the corrin ring, attachment of the aminopropanol arm, and assembly of the nucleotide loop that bridges the lower ligand dimethylbenzimidazole and the corrin ring" [161]. The enzymes of these branches of cobalamin biosynthesis and their associated GSPs are listed in Table 3. Only two pathway gaps remain open. For one of these, a candidate was proposed upon a detailed bioinformatic analysis [161] (for further details, see Supplementary Text S1 Section S3).
(d) In the cobalt-late (aerobic) pathway variant, the intermediates are cobalt-free, and cobalt is inserted only late in the pathway. Even though haloarchaea do not use the cobalt-late pathway, so that a late cobaltochelatase is not required, they code for a homolog of the large subunit of a characterized heterotrimeric late cobaltochelatase. The adjacent gene is homologous to small subunits of other chelatases. We speculate that this late cobaltochelatase may be involved in cobalamin salvage. The chelatase has a mosaic subunit structure, as also reported previously [161] (see Supplementary Text S1 Section S3 for details).
(e) In the alternative heme biosynthesis pathway, siroheme is decarboxylated to 12,18didecarboxysiroheme, which is attributed to the proteins encoded by ahbA and ahbB. These are homologous to each other and are organized as two two-domain proteins. It is unclear if AhbA and AhbB function independently or if they form a complex.

Coenzymes II: Coenzyme F420
Even though coenzyme F420 is predominantly associated with methanogenic archaea [190,191], it occurs also in bacteria, and a small amount of this coenzyme has been detected in non-methanogenic archaea, including halophiles [192]. The genes required for the biosynthesis of this coenzyme are encoded in haloarchaeal genomes, but the origin and attachment of the phospholactate moiety are not completely resolved (see below). To the best of our knowledge, only a single coenzyme F420-dependent enzymatic reaction has yet been reported for halophilic archaea [193]. Thus, the importance of this coenzyme in haloarchaeal biology is currently enigmatic and awaits experimental analysis.
(a) The pathway that creates the carbon backbone of this coenzyme has been reconstructed. We list the enzymes with their associated GSPs in Table 4. Coenzyme F420 contains a phospholactate moiety, which was reported to originate from 2-phospholactate [194], but this compound is metabolically not well-connected. As summarized in Supplementary Text S1 Section S4, there are various new insights regarding this pathway from recent studies in other prokaryotes [195,196]. To the best of our knowledge, the haloarchaeal coenzyme F420 biosynthesis pathway has never been experimentally analyzed.
(b) The prediction of coenzyme F420-specific oxidoreductases in Mycobacterium and actinobacteria has been reported [197], leading to patterns and domains that are also found in haloarchaea. Several such enzymes are described in Supplementary Text S1 Section S4.
(d) The precursor for coenzyme F420 may be used by a photolyase involved in DNA repair. Halophilic and methanogenic archaea use distinct coenzymes as one-carbon carriers (C1 metabolism): tetrahydrofolate in haloarchaea and methanopterin in methanogens [221,222]. Several characterized methanogenic proteins that act on or with methanopterin have comparably close homologs in haloarchaea (Table 5), which results in the misannotation of haloarchaeal proteins (e.g., in SwissProt) as being involved in methanopterin biology. We assume that the haloarchaeal proteins function with the haloarchaeal one-carbon carrier tetrahydrofolate and that this shift in coenzyme specificity is possible due to the structural similarity between methanopterin and tetrahydrofolate (a near-identical core structure consisting of a pterin heterocyclic ring linked via a methylene bridge to a phenyl ring) ( Figure 2). A detailed review on the many variants of the tetrahydrofolate biosynthetic pathway is available [223].
to HVO_A0533, which is a promising candidate for experimental analysis.
HVO_2628 shows 30% protein sequence identity with the enzyme catalyzing the first committed step to methanopterin biosynthesis. As detailed in Supplementary Text S1 Section S5, we consider it likely that it does not catalyze that reaction.
(c) Two enzymes that alter the oxidation level of the coenzyme-attached one-carbon compound probably function with tetrahydrofolate, even though their methanogenic homologs function with methanopterin. In contrast to their assignments in KEGG and UniProt (as of March 2021), their probable functions are thus methenyltetrahydrofolate cyclohydrolase (HVO_2573) and 5,10-methylenetetrahydrofolate reductase (HVO_1937) (see Figure 2 and Supplementary Text S1 Section S5).

Figure 2.
The structure of the C1 coenzymes tetrahydrofolate and methanopterin and two enzymes that act on the attached C1 compound. (A) The structures of tetrahydromethanopterin (top) and tetrahydrofolate (bottom) illustrate the similarities and differences between these C1 coenzymes. The common pteridine-based ring system is highlighted in yellow, and the initial biosynthesis step that generates this ring system is catalyzed by homologous enzymes (topic (b)). Two methanopterin-specific methyl groups are outlined by dashed ovals. N5 and N10, which are involved in the binding of the C1 compound, are colored red. (B) Two enzymatic reactions that alter the oxidation level of the C1 compound are illustrated. The methanogenic and haloarchaeal enzymes are homologous, even though they use distinct C1 coenzymes (topic (c)). It should be noted that MTH-1752 uses coenzyme F420 (not illustrated, Section 3.4, topic (c)), and this might also hold true for HVO_1937. PabB; paraaminobenzoate biosynthesis Figure 2. The structure of the C1 coenzymes tetrahydrofolate and methanopterin and two enzymes that act on the attached C1 compound. (A) The structures of tetrahydromethanopterin (top) and tetrahydrofolate (bottom) illustrate the similarities and differences between these C1 coenzymes. The common pteridine-based ring system is highlighted in yellow, and the initial biosynthesis step that generates this ring system is catalyzed by homologous enzymes (topic (b)). Two methanopterinspecific methyl groups are outlined by dashed ovals. N5 and N10, which are involved in the binding of the C1 compound, are colored red. (B) Two enzymatic reactions that alter the oxidation level of the C1 compound are illustrated. The methanogenic and haloarchaeal enzymes are homologous, even though they use distinct C1 coenzymes (topic (c)). It should be noted that MTH-1752 uses coenzyme F420 (not illustrated, Section 3.4, topic (c)), and this might also hold true for HVO_1937.
(a) Folate biosynthesis requires aminobenzoate. We proposed candidates for a pathway from chorismate to para-aminobenzoate [66,224] (for details, see Supplementary Text S1 Section S5). However, these predictions have not been adopted by KEGG (accessed April 2021), and without experimental confirmation, this is unlikely to ever happen.
(b) GTP cyclohydrolase MptA (HVO_2348) catalyzes a reaction in the common part of tetrahydrofolate and methanopterin biosynthesis. The enzymes specific for methanopterin biosynthesis are absent from haloarchaea, and thus, the assignment of HVO_2348 to the methanopterin biosynthesis pathway in UniProt is invalid (accessed March 2021).
The next common pathway step (EC 3.1.4.56) has been resolved in M. jannaschii (MJ0837) but is still a pathway gap in halophilic archaea. MJ0837 is very distantly related to HVO_A0533, which is a promising candidate for experimental analysis.
HVO_2628 shows 30% protein sequence identity with the enzyme catalyzing the first committed step to methanopterin biosynthesis. As detailed in Supplementary Text S1 Section S5, we consider it likely that it does not catalyze that reaction.
(c) Two enzymes that alter the oxidation level of the coenzyme-attached one-carbon compound probably function with tetrahydrofolate, even though their methanogenic homologs function with methanopterin. In contrast to their assignments in KEGG and UniProt (as of March 2021), their probable functions are thus methenyltetrahydrofolate cyclohydrolase (HVO_2573) and 5,10-methylenetetrahydrofolate reductase (HVO_1937) (see Figure 2 and Supplementary Text S1 Section S5).

Coenzymes IV: NAD and FAD (Riboflavin)
(a) The energy source for NAD kinase may be ATP or polyphosphate. This is unresolved for the two paralogs of probable NAD kinase (HVO_2363, nadK1 and HVO_0837,  nadK2). These show only 25% protein sequence identity to each other (see Supplementary Text S1 Section S6). Polyphosphate was not found in exponentially growing Hfx. volcanii cells [235], and thus ATP is the more likely energy source.
(b) HVO_0781 is encoded in nearly all haloarchaeal genomes, according to OrthoDB, and shows very strong syntenic coupling with the adjacent gene, HVO_0782, according to SyntTax analysis. Characterized homologs to HVO_0781 cleave S-adenosyl-methionine into methionine and adenosine, a reaction that seems wasteful. If so, then this gene would not be expected to be retained in most species and neither would it maintain a strongly conserved gene clustering (see Supplementary Text S1 Section S6). HVO_0782 is an enzyme involved in NAD biosynthesis, which is encoded in most haloarchaeal and archaeal genomes. Thus, HVO_0781 is also a candidate for being involved in NAD biosynthesis.
(c) We described the reconstruction of riboflavin biosynthesis based on a detailed bioinformatic reconstruction [236]. The enzymes and their associated GSPs are listed in Table 6. Three pathway gaps remain, with candidate genes predicted for two of these [236] (for details, see Supplementary Text S1 Section S6).

Biosynthesis of Membrane Lipids, Bacterioruberin and Menaquinone
Archaeal membrane lipids contain ether-linked isoprenoid side chains (see [250] and the references cited therein). The isoprenoid precursor isopentenyl diphosphate is synthesized in haloarchaea by a modified version of the mevalonate pathway [251]. Isoprenoid units are then linearly condensed into the C20 compound geranylgeranyl diphosphate. The haloarchaeal core lipid, archaeol, consists of 2,3-sn-glycerol with two C20 isoprenoid side chains attached by ether linkages. In some archaea, especially alkaliphiles, C25 isoprenoids are also found (see, e.g., [252,253]). Additionally, a number of distinct headgroups are found in polar lipids (phospholipids) (reviewed in [250]) (Figure 3). Even though polar lipids are used as important taxonomic markers [254], their biosynthetic pathways are not completely resolved. Genes 2021, 12, x FOR PEER REVIEW 21 of 40 Haloarchaea typically have a red color, which is due to carotenoids, mainly the C50 carotenoid bacterioruberin [255][256][257]. For carotenoid biosynthesis, two molecules of geranylgeranyl diphosphate, a C20 compound, are linked head-to-head to generate phytoene, which is desaturated to lycopene [66,258]. The pathway from lycopene to the C50 compound bacterioruberin has been experimentally characterized [257,259].
(a) We assigned HVO_2725 (idsA1, paralog of NP_3696A) and HVO_0303 (idsA2, paralog of NP_0604A) for the linear isoprenoid condensation reactions, resulting in a C20 isoprenoid (EC 2.5.1.10 and EC 2.5.1.29, short-chain isoprenyl diphosphate synthase) (see, also, Supplementary Text S1 Section S7). Some archaea, mainly haloalkaliphiles, also contain C25 isoprenoid side chains. Geranylfarnesyl diphosphate synthase, the enzyme that generates the C25 isoprenoids, has been purified and enzymatically characterized from Nmn. pharaonis [260], but data required for the assignment to a specific gene have not been collected. Three paralogous genes from Nmn. pharaonis are candidates for this function (NP_0604A, NP_3696A and NP_4556A). Since NP_0604A and NP_3696A have orthologs in Hfx. volcanii, a species devoid of C25 lipids, we assigned the synthesis of C25 isoprenoids (geranylfarnesyl diphosphate synthase activity) to the third paralog, NP_4556A. UniProt assigned C25 biosynthesis activity to NP_3696A for undescribed reasons (as of April 2021), and KEGG does not make this assignment for any of the three paralogs (as of Haloarchaea typically have a red color, which is due to carotenoids, mainly the C50 carotenoid bacterioruberin [255][256][257]. For carotenoid biosynthesis, two molecules of geranylgeranyl diphosphate, a C20 compound, are linked head-to-head to generate phytoene, which is desaturated to lycopene [66,258]. The pathway from lycopene to the C50 compound bacterioruberin has been experimentally characterized [257,259]. (a) We assigned HVO_2725 (idsA1, paralog of NP_3696A) and HVO_0303 (idsA2, paralog of NP_0604A) for the linear isoprenoid condensation reactions, resulting in a C20 isoprenoid (EC 2.5.1.10 and EC 2.5.1.29, short-chain isoprenyl diphosphate synthase) (see, also, Supplementary Text S1 Section S7). Some archaea, mainly haloalkaliphiles, also contain C25 isoprenoid side chains. Geranylfarnesyl diphosphate synthase, the enzyme that generates the C25 isoprenoids, has been purified and enzymatically characterized from Nmn. pharaonis [260], but data required for the assignment to a specific gene have not been collected. Three paralogous genes from Nmn. pharaonis are candidates for this function (NP_0604A, NP_3696A and NP_4556A). Since NP_0604A and NP_3696A have orthologs in Hfx. volcanii, a species devoid of C25 lipids, we assigned the synthesis of C25 isoprenoids (geranylfarnesyl diphosphate synthase activity) to the third paralog, NP_4556A. UniProt assigned C25 biosynthesis activity to NP_3696A for undescribed reasons (as of April 2021), and KEGG does not make this assignment for any of the three paralogs (as of April 2021). Our assignments are supported by an analysis of the key residues that determine the length of the isoprenoid chain [261]. These authors labeled the cluster containing NP_3696A (WP011323557.1) as "C15/C20" and the cluster containing NP_4556A (WP011323984.1) as "C20->C25->C30?".
(b) Typical polar lipids in haloarchaea (Figure 3) are phosphatidylglycerophosphate methyl ester (PGP-Me) and phosphatidylglycerol (PG) but, also, phosphatidylglycerosulfate (PGS) [261][262][263]. Other polar lipids are archaetidylserine and its decarboxylation product archaetidylethanolamine, both of which are found in rather low quantities in Haloferax [264]. A third group of polar lipids has a headgroup derived from myo-inositol. The biosynthetic pathway of the headgroup is only partially resolved. One CDP-archaeol 1-archaetidyltransferase that belongs to a highly conserved three-gene operon may attach either glycerol phosphate or myo-inositol phosphate. In Supplementary Text S1 Section S7, we summarize the arguments in favor of each of these candidates, but the true function can only be decided by experimental analysis.
(c) Carotenoid biosynthesis involves the head-to-head condensation of the C20 isoprenoid geranylgeranyl diphosphate to phytoene, which is desaturated to lycopene [66,258]. The crtB gene product (e.g., HVO_2524) catalyzes the head-to-head condensation. It is yet uncertain which gene product is responsible for the desaturation of phytoene to lycopene. The further pathway from lycopene to bacterioruberin has been experimentally characterized in Haloarcula japonica [257]. A three-gene cluster (crtD-lyeJ-cruF) codes for the three enzymes of this pathway. The synteny of this three gene cluster is strongly conserved, according to SyntTax analysis. Several genes that are certainly or possibly involved in carotenoid biosynthesis are encoded in the vicinity of this cluster (for details, see Supplementary Text S1 Section S7).
(d) Halophilic archaea contain menaquinone as a lipid-based two-electron carrier of the respiratory chain [264,265]. We described the reconstruction of the menaquinone biosynthesis pathway (Table 7), with two pathway gaps remaining open (see Supplementary Text S1 Section S7 for details).    [290,291]. Purified RNA polymerase containing the epsilon subunit transcribes native templates efficiently, in contrast to the RNA polymerase devoid of this subunit [291]. The biological relevance of this subunit is enigmatic (see Supplementary Text S1 Section S8).
(b) Two distant paralogs are found for haloarchaeal ribosomal protein S10 (uS10) in nearly all haloarchaeal genomes. It is uncertain if both occur in the ribosome, whether they occur together or are mutually exclusive. The latter distribution would result in heterogeneity of the ribosomes. Alternatively, one of the paralogs may exclusively have a non-ribosomal function.
In a subset of archaea, two distant paralogs are found for haloarchaeal ribosomal protein S14 (uS14) (ca 20% of the genomes, e.g., in Nmn. pharaonis). For more details, see Supplementary Text S1 Section S8.
(c) The ribosomal protein L43e (eL43) shows heterogeneity with respect to the presence of the C2-C2-type zinc finger motif. This zinc finger is found in L43e from all Halobacteriales and all euryarchaeal proteins outside the order Halobacteria but is not found in Haloferacales and is very rare in Natrialbales. Eukaryotic orthologs (e.g., from rat and yeast) contain this zinc finger, and its biological importance has been experimentally shown for the yeast protein [292] (for details, see Supplementary Text S1 Section S8).
(d) Diphthamide is a complex covalent modification of a histidine residue of translation elongation factor a-EF2. This pathway has been reconstructed (Table 8) based on distant homologs (enzymes encoded by dph2 and dph5) and by a detailed bioinformatic analysis (enzyme encoded by dph6) [293] (for details, see Supplementary Text S1 Section S8). These uncertain function assignments await experimental confirmation.
(e) N-terminal signal sequences target proteins to the secretion machinery. Subsequent to membrane insertion or transmembrane transfer, the signal sequence is cleaved off by a signal peptidase. After cleavage, the signal peptide must be degraded to avoid clogging of the membrane. Degradation is catalyzed by signal peptide peptidase. Candidates for this activity have been predicted from two protein families [294,295] (for details, see Supplementary Text S1 Section S8).

Miscellaneous Metabolic Enzymes and Proteins with Other Functions
Here, we list a few other enzymatic or nonenzymatic functions for which candidate genes have been assigned but without experimental validation.
(a) Ketohexokinase from Haloarcula vallismortis has been experimentally characterized [307]. However, the activity was not assigned to a gene. Detailed bioinformatic analyses have been made [308,309] and point to a small set of orthologs represented by Hmuk_2662, the ortholog of HVO_1812 (for further details, see Supplementary Text S1 Section S9).
(b) The assignment of fructokinase activity to the Hht. litchfieldiae candidate gene hal-TADL_1913 (UniProt:A0A1H6QYL4) is based on a differential proteomic analysis [309] (see Supplementary Text S1 Section S9 for details). Very close homologs are rare in haloarchaea. For this protein family (carbohydrate kinase), it is unclear if more distant homologs (with about 50% protein sequence identity) are isofunctional.
(c) A candidate gene for glucoamylase is HVO_1711 for the reasons described in Supplementary Text S1 Section S9. The enzyme from Halorubrum sodomense has been characterized [310], but the activity has not yet been assigned to a gene.
(f) Haloarchaea may contain an NAD-independent L-lactate dehydrogenase, LudBC (HVO_1692 and HVO_1693). The deletion of this gene pair impairs growth on rhamnose, which is catabolized to pyruvate and lactate [21]. There is a very distant relationship (for details, see Supplementary Text S1 Section S9) to the LldABC subunits of the characterized L-lactate dehydrogenase from Pseudomonas stutzeri A1501 [317] and to the LutABC proteins from B. subtilis, which have been shown to be involved in lactate utilization [316].
(g) Hfx. volcanii may be able to convert urate into allantoin using the gene cluster HVO_B0299-HVO_B0302. This could be part of a complete degradation pathway for purines, but this has to be considered highly speculative (see Supplementary Text S1 Section S9 and Supplementary Figure S2).
(h) Hfx. volcanii may contain an enzyme having a "nickel-pincer cofactor". The biogenesis of this cofactor may be catalyzed by larBCE (as detailed in Supplementary Text S1 Section S9).
(i) Cyclic di-AMP (c-di-AMP) is an important nucleotide signaling molecule in bacteria and archaea. It is generated from two molecules of ATP by diadenylate cyclase (encoded by dacZ) and is degraded to pApA by phosphodiesterases [336]. The level of this signaling molecule is strictly controlled [337,338], thus requiring a sophisticated interplay of cyclase and phosphodiesterase. DacZ from Hfx. volcanii has been characterized, and it was shown that the c-di-AMP levels must be tightly regulated [37]. The degrading enzyme, however, has not yet been identified in Haloferax, but candidates have been proposed [332,336,339] (see Supplementary Text S1 Section S9).
(j) HVO_2763 is distantly related to RNase Z (HVO_0144, rnz). The experimental characterization of HVO_2763 [333] excluded activity as an exonuclease but did not reveal its physiological function. Upon transcriptome analysis, the downregulation of several genes was detected. Several of these were uncharacterized at the time of the experiment but have since been shown to be involved in the minor N-glycosylation pathway that was initially detected under low-salt conditions (see Supplementary Text S1 Section S9 for further details).
(k) A pair of genes (dabAB, HVO_2410 and HVO_2411) is predicted to function as a carbon dioxide transporter, based on the identification of such transporters in Halothiobacillus neapolitanus [335]. Being a member of the proton-conducting membrane transporter family, this protein may be misannotated as a subunit of the nuo or mrp complexes (see Supplementary Text S1 Section S9 for further details).

Conclusions
We described a large number of cases where the protein function cannot be correctly predicted when restricting considerations to the computational analyses without taking the biological contexts into account. An example was the switch from methanopterin to tetrahydrofolate as a C1 carrier in haloarchaea. Homologous enzymes, inherited from the common ancestor, have adapted to the new C1 carrier, rather than being replaced by non-homologous proteins. Function prediction tools may misannotate haloarchaeal proteins to work with methanopterin. Another example was the nuo complex and its misannotation as a type I NADH dehydrogenase. In other cases, even a distant sequence similarity may allow a valid function prediction if additional evidence (e.g., from a gene neighborhood analysis or from a detailed evaluation of the metabolic pathway gaps) is taken into account. Examples include cobalamin cluster proteins, which probably close the two residual pathway gaps, and the predicted degradation pathway for purines. In all these cases, we presented reasonable hypotheses based on the current knowledge, and in many cases, these were so well-supported as to be compelling, but to be certain, experimental data are required. With this overview, we attempted to arouse the curiosity of our colleagues, hoping that they will confirm or disprove our speculations and, thus, advance the knowledge about haloarchaeal biology. Hfx. volcanii is a model species for halophilic archaea, and the more complete and correctly its genome is annotated, the higher will be its value for system biology analyses (modeling) and for synthetic biology (metabolic engineering) and biotechnology.
Supplementary Materials: The following are available online at https://www.mdpi.com/article/10 .3390/genes12070963/s1: Text S1: Detailed background information for all open annotation issues. Figure S1: Alignment of the 5' end of leuA2 (HVO_1510) from Hfx. volcanii DS2 with homologs from other species of Haloferax. Figure S2: The proposed purine degradation pathway. Table S1: Listing of all proteins mentioned in the text and in Supplementary Text S1. Table S2: Listing of the genomes which are manually curated and kept up to date. Funding: This research received no specific grants from any funding agency in the public, commercial or non-for-profit sectors.
Institutional Review Board Statement: Not applicable.

Informed Consent Statement: Not applicable.
Data Availability Statement: Not applicable.