Phylogeny and Structure of Fatty Acid Photodecarboxylases and Glucose-Methanol-Choline Oxidoreductases

: Glucose-methanol-choline (GMC) oxidoreductases are a large and diverse family of ﬂavin-binding enzymes found in all kingdoms of life. Recently, a new related family of proteins has been discovered in algae named fatty acid photodecarboxylases (FAPs). These enzymes use the energy of light to convert fatty acids to the corresponding C n-1 alkanes or alkenes, and hold great potential for biotechnological application. In this work, we aimed at uncovering the natural diversity of FAPs and their relations with other GMC oxidoreductases. We reviewed the available GMC structures, assembled a large dataset of GMC sequences, and found that one active site amino acid, a histidine, is extremely well conserved among the GMC proteins but not among FAPs, where it is replaced with alanine. Using this criterion, we found several new potential FAP genes, both in genomic and metagenomic databases, and showed that related bacterial, archaeal and fungal genes are unlikely to be FAPs. We also identiﬁed several uncharacterized clusters of GMC-like proteins as well as subfamilies of proteins that lack the conserved histidine but are not FAPs. Finally, the analysis of the collected dataset of potential photodecarboxylase sequences revealed the key active site residues that are strictly conserved, whereas other residues in the vicinity of the ﬂavin adenine dinucleotide (FAD) cofactor and in the fatty acid-binding pocket are more variable. The identiﬁed variants may have di ﬀ erent FAP activity and selectivity and consequently may prove useful for new biotechnological applications, thereby fostering the transition from a fossil carbon-based economy to a bio-economy by enabling the sustainable production of hydrocarbon fuels.


Introduction
The last few decades have seen a dramatic increase in the research area of photocatalysis as evident from the increasing number of publications in the field [1]. Photocatalysis hereby refers to a reaction that requires light as an energy source for conversion of a substrate. Since Fujishima and Glucose oxidases Oxidation of β-d-glucose at the C1 hydroxyl group utilizing oxygen as electron acceptor with the concomitant production of d-glucono-delta-lactone and hydrogen peroxide [43]. GOx are highly specific for β-d-glucose as a substrate, although some of the species can also oxidize other sugars, such as d-galactose, d-mannose or d-xylose [43,44].

FAD-dependent glucose dehydrogenases
Transformation of glucose at the first hydroxyl group into glucono-1,5-lactone and does not utilize oxygen as the electron acceptor.
Found in Gram-negative bacteria, fungi, and in some insects [43,45]. Alcohol oxidases(also known as methanol oxidases) Oxidation of methanol as well as other short aliphatic alcohols with two to four carbon atoms [46][47][48] to the corresponding carbonyl compounds accompanied by a release of hydrogen peroxide.
Mainly found in yeasts and filamentous fungi [48]. Aryl-alcohol oxidases Oxidation of a plethora of aromatic, and some aliphatic, polyunsaturated alcohols bearing conjugated primary hydroxyl groups [49] accompanied by the formation of hydrogen peroxide at the expense of dioxygen [48].
The spread appears to be limited to a narrow group of fungi (Agaricaceae) [57]. Oxidation of formate to carbon dioxide and utilization of oxygen as an electron acceptor. They may also exhibit a low methanol oxidase activity [58].
Found in numerous wood-degrading fungi, both in basidiomycetes and ascomycetes [69,71]. Reversible cleavage of cyanohydrins such as (R)-mandelonitrile into the corresponding aldehyde or ketone and hydrogen cyanide.

HAOx
Hydroxy fatty acid oxidase Oxidation of long-chain ω-hydroxy fatty acids to ω-oxo fatty acids was ascribed to ACE/HTH [76], a 594 amino acid-long GMC family protein not related to other HAOxs.
Perhaps the most interesting GMC family members at the moment are fatty acid photodecarboxylases (FAP), a recently discovered class of enzymes initially identified in the green algae Chlorella variabilis [77], which convert long chain fatty acids to the corresponding C n-1 alkanes/alkenes in the presence of blue light. FAP activity was also confirmed for proteins from Chlamydomonas reinhardtii [77], Galdieria sulfuraria, Chondrus crispus, Nannochloropsis gaditana and Ectocarpus siliculosus [78]. Overall, FAP genes were identified in more than 30 algal species as well as in metagenomic datasets [78]. The active site of a FAP forms a narrow hydrophobic tunnel, which can accommodate long chain fatty acid, with the carboxyl group in the vicinity of the FAD cofactor [77]. Mechanistically, it was suggested that the photoexcited FAD molecule of FAPs abstracts an electron from the fatty acid substrate yielding a fatty acid radical, which decarboxylates to yield an alkyl radical. This is likely followed, as suggested in a very recent study, by hydrogen atom transfer from a conserved cysteine residue (C432 in Chlorella variabilis FAP, CvFAP) in the FAP active site to the alkyl radical, yielding the final alkane product [77,79].
Whereas FAPs are the only currently recognized photoenzymes in the GMC family, other proteins have been reported to respond to illumination. AOx from Candida tropicalis is rapidly inactivated by light [80]. Exposure to sunlight generates an ultrastable 8-formyl FAD semiquinone radical in FOx [81]. In COx, illumination leads to formation of metastable protein-flavin adduct [82].
In addition to being an interesting new class of photoenzymes, FAPs possess great potential for biotechnological application. Chlorella variabilis FAP (CvFAP) has recently been used to produce pentadecane (79% conversion with 16% yield) in a preparative scale synthesis showing high tolerance towards organic solvents such as dimethyl sulfoxide [83]. Carboxylic acid substrate scope and reaction rate of CvFAP was further altered in a protein engineering campaign on the substrate binding channel of CvFAP [84] and by using decoy molecules to fill up the vacant space in the substrate access channel of the enzyme [85]. CvFAP had also been used in cascade reactions with other enzymes in order to produce long-chain aliphatic amines and esters [86]. These studies show the potential of FAPs to be used in industrial biotechnological processes.
In this work, therefore, we set out to assess the diversity of GMC proteins and identify putative FAPs in genomic and metagenomic databases, with the goal of understanding their natural diversity, which could guide the discovery and design of new variants with potentially altered activity and specificity. Consequently, we used the available experimental data to develop criteria that would distinguish FAPs from other GMC family proteins, and applied these criteria to~150,000 publicly available GMC sequences. We identified several previously unreported putative FAP sequences, and analyzed the variability of the FAP active site and substrate-binding pocket amino acids. The results obtained may be useful for the design of FAPs with improved or even new properties for biotechnological applications, and hence could foster the transition from a carbon-based economy to the sustainable production of hydrocarbon fuels in the framework of bioeconomy.

Fatty Acid Photodecarboxylases (FAP) Domain Annotation and Structure
In order to identify putative FAP genes in genomic and metagenomic databases, we started with reviewing the domain annotation of the best characterized FAP protein, Chlorella variabilis FAP (CvFAP), and its crystallographic structure ( Figure 1). The 654 amino acid-long protein is annotated in the Pfam protein families database [87] as having two domains, GMC_oxred_N (PF00732, residues 84 to 383) and GMC_oxred_C (PF05199, 492 to 632, Figure 1a), corresponding to N-and C-terminal parts of glucose-methanol-choline oxidoreductases, respectively. Whereas definitions of protein domains often postulate a compact structure that folds relatively independently of the rest of the protein [88][89][90], this is not the case for GMC_oxred_N and GMC_oxred_C, as the respective parts of the protein are interspersed and clearly cannot function independently of each other (Figure 1b). Moreover, the FAP substrate, the fatty acid (modeled as a palmitic acid in the structure), is largely coordinated by the residues not belonging to either GMC_oxred_N or GMC_oxred_C (residues 383 to 492). This observation shows that FAP and GMC proteins need to be analyzed as a whole and not as two separate domains. the protein are interspersed and clearly cannot function independently of each other (Figure 1b). Moreover, the FAP substrate, the fatty acid (modeled as a palmitic acid in the structure), is largely coordinated by the residues not belonging to either GMC_oxred_N or GMC_oxred_C (residues 383 to 492). This observation shows that FAP and GMC proteins need to be analyzed as a whole and not as two separate domains.  [87]. (b) Three-dimensional structure of CvFAP [77]. Amino acids corresponding to GMC_oxred_N are colored teal and amino acids corresponding to GMC_oxred_C are colored salmon. The cofactor FAD is colored magenta and the fatty acid (FA) substrate is shown in green. GMC_oxred_N and GMC_oxred_C are interspersed in the structure, and FA is harbored by part of the structure not assigned to any of the two domains (residues 383-492). (c) Structure of the CvFAP active site. All of the protein backbone is colored in yellow. Cys432 and Arg451 are conserved catalytic residues [77][78][79]. Asn575 and Ala576 are situated atop the FAD cofactor. Gln620 coordinates Arg451.
Our next goal was to determine the residues that could discriminate FAP from other GMC proteins. Residues Cys432 and Arg451 have been shown previously to be critical for catalysis, whereas Tyr466 is important, but can be mutated to phenylalanine without the loss of activity [77,79]. We also want to highlight the residue Ala576, situated atop the FAD cofactor isoalloxazine ring, which, as we will show below, is also characteristic for FAPs, and belongs to the GMC_oxred_C domain, whereas Cys432 and Arg451 are not annotated as belonging either to GMC_oxred_N or GMC_oxred_C.

Common Features of Known Glucose-Methanol-Choline (GMC) Proteins
Currently, the Pfam database [87] lists experimentally determined structures of 26 different GMC proteins, including CvFAP, available in the protein data bank (PDB). The pyranose 2-oxidase genes, such as that of Trametes multicolor (UniProt ID Q7ZA32, PDB ID 1TT0) are recognized in Pfam as having the GMC_oxred_C domain but not the GMC_oxred_N domain, despite encoding proteins highly similar to other GMC family members. Crystallographic structures are available for all but four currently known enzyme classes (choline dehydrogenases, fructose dehydrogenases, compound K oxidases and hydroxy fatty acid oxidases). For some of the proteins, more than one structure has been determined, and for some of the enzyme classes, more than one protein has been characterized structurally. Thus, for further analysis, we selected one representative structure for each of the enzyme classes (listed in Table 2), preferably the one with the highest-resolution structure showing the interactions of the enzyme with its substrate.  [87]. (b) Three-dimensional structure of CvFAP [77]. Amino acids corresponding to GMC_oxred_N are colored teal and amino acids corresponding to GMC_oxred_C are colored salmon. The cofactor FAD is colored magenta and the fatty acid (FA) substrate is shown in green. GMC_oxred_N and GMC_oxred_C are interspersed in the structure, and FA is harbored by part of the structure not assigned to any of the two domains (residues 383-492). (c) Structure of the CvFAP active site. All of the protein backbone is colored in yellow. Cys432 and Arg451 are conserved catalytic residues [77][78][79]. Asn575 and Ala576 are situated atop the FAD cofactor. Gln620 coordinates Arg451.
Our next goal was to determine the residues that could discriminate FAP from other GMC proteins. Residues Cys432 and Arg451 have been shown previously to be critical for catalysis, whereas Tyr466 is important, but can be mutated to phenylalanine without the loss of activity [77,79]. We also want to highlight the residue Ala576, situated atop the FAD cofactor isoalloxazine ring, which, as we will show below, is also characteristic for FAPs, and belongs to the GMC_oxred_C domain, whereas Cys432 and Arg451 are not annotated as belonging either to GMC_oxred_N or GMC_oxred_C.

Common Features of Known Glucose-Methanol-Choline (GMC) Proteins
Currently, the Pfam database [87] lists experimentally determined structures of 26 different GMC proteins, including CvFAP, available in the protein data bank (PDB). The pyranose 2-oxidase genes, such as that of Trametes multicolor (UniProt ID Q7ZA32, PDB ID 1TT0) are recognized in Pfam as having the GMC_oxred_C domain but not the GMC_oxred_N domain, despite encoding proteins highly similar to other GMC family members. Crystallographic structures are available for all but four currently known enzyme classes (choline dehydrogenases, fructose dehydrogenases, compound K oxidases and hydroxy fatty acid oxidases). For some of the proteins, more than one structure has been determined, and for some of the enzyme classes, more than one protein has been characterized structurally. Thus, for further analysis, we selected one representative structure for each of the enzyme classes (listed in Table 2), preferably the one with the highest-resolution structure showing the interactions of the enzyme with its substrate.
Overall, the structures reveal a well conserved fold ( Figure 2a) and a similar mode of binding of the cofactor FAD. In some cases, FAD can be covalently bound, for example by forming a C8α-His covalent bond with the protein. The substrate-binding pocket is somewhat more variable compared to the rest of the proteins.

Ligand-Binding Pockets of GMC Family Proteins
Understanding the conserved and variable features of the GMC proteins is required for delineating the differences between the FAPs and other family members. The catalytic properties of an enzyme are defined by the active site amino acids and their geometry. Active site arrangements in representative crystallographic structures of GMC proteins are shown in Figure 3.
The structures reveal several cases where the FAD molecule is covalently bound to the protein or is autocatalytically modified. In pyranose dehydrogenase from Agaricus maleagris, FAD is modified by a covalent mono-or di-atomic species at the C(4a) position (Figure 3f, [96]). In formate oxidase of Aspergillus oryzae, FAD is formylated at the C8α position (Figure 3n, [104]); the modification is autocatalytic and enhances the enzyme activity [107]. Finally, in some GMC family members FAD is covalently attached to the protein via C8α-His bonds (Figure 3a,i,m), similar to VAO-type proteins [31].
All depicted active sites feature a conserved histidine amino acid close to the isoalloxazine ring of FAD. The histidine is surrounded by polar amino acids. The substrates approach FAD from the same direction and are also coordinated by polar amino acids. Overall, while the protein backbone structure is conserved (Figure 2b), the side chains and their positions do vary significantly. Thus, the Closer analysis of the active sites revealed that all of the surveyed proteins have a histidine amino acid atop the FAD cofactor isoalloxazine ring (Figure 2b). Whereas other histidines are often observed in the active sites of some of the proteins, but are not strictly conserved, this particular histidine is absolutely conserved among the analyzed structures (shown in blue in Figure 2b). We note that previous studies have shown that this histidine is thought to serve as a catalytic base in POx, GOx, AAOx, CDH and PNOx, whereas in ChOx, COx, and AOx, the histidine is conserved, but its role is less clear (reviewed by Wongnate and Chaiyen, [106]).

Ligand-Binding Pockets of GMC Family Proteins
Understanding the conserved and variable features of the GMC proteins is required for delineating the differences between the FAPs and other family members. The catalytic properties of an enzyme are Whereas the obtained set of sequences did not contain duplicates, some of the sequences were highly similar. Analysis of this number of sequences is computationally prohibitive. Consequently, we clustered the sequences with the idea that the clusters of interest can then be re-analyzed in more detail. Clustering at the level of 40% sequence identity produced 5660 clusters. Representative sequences (centroid sequences) from each cluster were used in downstream analyses. The structures reveal several cases where the FAD molecule is covalently bound to the protein or is autocatalytically modified. In pyranose dehydrogenase from Agaricus maleagris, FAD is modified by a covalent mono-or di-atomic species at the C(4a) position (Figure 3f, [96]). In formate oxidase of Aspergillus oryzae, FAD is formylated at the C8α position (Figure 3n, [104]); the modification is autocatalytic and enhances the enzyme activity [107]. Finally, in some GMC family members FAD is covalently attached to the protein via C8α-His bonds (Figure 3a,i,m), similar to VAO-type proteins [31].
All depicted active sites feature a conserved histidine amino acid close to the isoalloxazine ring of FAD. The histidine is surrounded by polar amino acids. The substrates approach FAD from the same direction and are also coordinated by polar amino acids. Overall, while the protein backbone structure is conserved (Figure 2b), the side chains and their positions do vary significantly. Thus, the conserved histidine amino acid is the only truly common characteristic amino acid of non-FAP GMC family members.

Phylogenetic Analysis of GMC Proteins
Having established the main features of FAPs and other GMC oxidoreductases, we performed the phylogenetic analysis of publicly available sequences. To obtain the full coverage of the GMC family, we assembled several sets of sequences (Table 3), which were then clustered and analyzed. First, we retrieved the representative GMC_oxred_N and GMC_oxred_C sequences (seed sequences) from Pfam [87]. These sequences were used to perform the PSI-BLAST search against the non-redundant set of sequences in the NCBI database. In total, 147,949 GMC_oxred_N-containing sequences and 150,593 GMC_oxred_C-containing sequences have been identified. Of these, 135,174 sequences contained both domains, and 163,368 sequences contained at least one of the two domains GMC_oxred_N or GMC_oxred_C. The latter set is the most extensive one and should contain all potential proteins of interest. We note that the Pfam [87] and InterPro [108] databases contained around 30,000 and 100,000 GMC sequences, respectively, at the moment of the writing of this article. Whereas the obtained set of sequences did not contain duplicates, some of the sequences were highly similar. Analysis of this number of sequences is computationally prohibitive. Consequently, we clustered the sequences with the idea that the clusters of interest can then be re-analyzed in more detail. Clustering at the level of 40% sequence identity produced 5660 clusters. Representative sequences (centroid sequences) from each cluster were used in downstream analyses.
Besides the representative sequences of those found in the NCBI database (labeled B1 in the Table 3), we also wanted to explicitly include several other sets of sequences in the analysis ( Table 3). The set B2 included all of the sequences from the clusters where the representative sequence belonged to algae, was similar to CvFAP, and had an alanine at the place of the conservative histidine. The set B3 included the top 500 PSI-BLAST NCBI hits obtained using the putative photodecarboxylase sequences from B2. The sets B4 and B5 were the sequences reported by Sorigue et al. [77] and Moulin et al. [78], respectively. The set B6 included the putative FAP sequences from the organisms Tetrabaena socialis, Chloropicon primus, Porphyridium purpureum, Haematococcus lacustris and Fragilaria radians that were identified in NCBI using BLAST searches and were not previously reported [77,78]. The set B7 contained the metagenomic sequences from Tara oceans [109,110] identified using the MMseqs2 webserver [111,112]. Finally, the set B8 included the sequences of GMC proteins whose crystallographic structures have been determined, as listed in the Pfam database [87]. All sequences were put into the joint dataset B_unique; if duplicate sequences were retrieved from different sources, the corresponding records were merged. The information about the sequences in the datasets A6 and B_unique may be found in Supplementary Datasets 1, 2 and 3.
Next, we prepared a phylogenetic tree showing the relations between the sequences from B_unique ( Figure 4; the data may be found in Supplementary Datasets 4 and 5). The tree reveals a number of distinct sequence clusters. Some of them contain sequences that have been characterized experimentally, whereas others do not. Whereas GOx and GDH group together, other enzymes acting on the same substrates -POx and PDH, COx and CHDH -belong to different branches. We have also prepared a similar tree where the sequences are marked according to their host organism (Supplementary Figure S1); sequences from similar sources are often grouped together.  Overall, the phylogenetic tree shows that FAPs form a distinct cluster of sequences. Whereas taxonomic classification for metagenomic sequences is lacking, all of the genomic sequences in the cluster belong to algae. Out of the whole set B_unique, only a few other sequences lacked the conserved histidine. However, the ones where it was replaced with an alanine appear to be not related to FAPs: ODM93031.1 from hexapod Orchesella cincta and PVI01917.1 from fungus Periconia Overall, the phylogenetic tree shows that FAPs form a distinct cluster of sequences. Whereas taxonomic classification for metagenomic sequences is lacking, all of the genomic sequences in the cluster belong to algae. Out of the whole set B_unique, only a few other sequences lacked the conserved histidine. However, the ones where it was replaced with an alanine appear to be not related to FAPs: ODM93031.1 from hexapod Orchesella cincta and PVI01917.1 from fungus Periconia macrospinosa lack the amino acids homologous to Cys432 and Arg451; WP_106347998.1 from actinobacterium Antricoccus suffuscus, while having the cysteine, has a deleted loop nearby, and lacks the arginine.
The genes that have the highest similarity to FAPs are found in bacterial and archaeal genomes, such as the genes with unknown function from Fischerella thermalis (WP_102185191.1) and Haloterrigena thermotolerans (WP_006648055.1). The respective proteins harbor the conserved histidine amino acid and lack the cysteine and arginine amino acids crucial for FAP function, so they are unlikely to be related.
Interestingly Whereas fungal proteins have been previously studied in detail [38], the examples of unusual GMC proteins in insects led us to study them in more detail. We run the PSI-BLAST searches [113] against the sequences from the genomes of representative species. Surprisingly, we found that many of them possess multiple genes encoding GMC proteins. The Apis mellifera (Honeybee) genome encodes around 20 GMC proteins; Musca domestica (House Fly) and Anopheles gambiae (African Malaria Mosquito) around 30; Papilio xuthus (Asian Swallowtail Butterfly) around 40; Blattella germanica (German cockroach), Plutella xylostella (diamondback moth), Lygus hesperus (Western Plant Bug) more than 50; Photinus pyralis (Common Eastern Firefly) more than 70. The genome of the soil-dwelling hexapod Folsomia candida encodes more than 150. Although not an insect, pacific oyster Crassostrea gigas contains 38 GMC genes in its genome. The common feature between these organisms may be the requirement for significant detoxification capabilities, which is partially fulfilled through the expansion of the GMC family proteins, some of which develop unusual catalytic mechanisms not reliant on the conserved histidine amino acid.

Phylogenetic Analysis of Putative FAP Proteins
As a next step, we analyzed the FAP branch of the overall phylogenetic tree. We built a smaller phylogenetic tree ( Figure 5; the data may be found in Supplementary Datasets 6 and 7), now containing only the putative FAP proteins and three outgroup sequences (choline oxidase, pyridoxine 4-oxidase and 5-(hydroxymethyl)furfural oxidase). This tree reveals several sequence clusters (1-9), which roughly correspond to evolutionary relations between the host organisms [78].

Phylogenetic Analysis of Putative FAP Proteins
As a next step, we analyzed the FAP branch of the overall phylogenetic tree. We built a smaller phylogenetic tree ( Figure 5; the data may be found in Supplementary Datasets 6 and 7), now containing only the putative FAP proteins and three outgroup sequences (choline oxidase, pyridoxine 4-oxidase and 5-(hydroxymethyl)furfural oxidase). This tree reveals several sequence clusters (1)(2)(3)(4)(5)(6)(7)(8)(9), which roughly correspond to evolutionary relations between the host organisms [78].  The most prominent cluster (#1) contains several genomic sequences and is centered around the proteins from Chlorella variabilis and Chlamydomonas reinhardtii. Metagenomic sequences 1, 3 and 7 from the dataset B7 are progressively less similar to the sequences from cluster #1. The cluster #2 contains proteins from Nannochloropsis gaditana, Nannochloropsis salina and Ectocarpus siliculosus, but not, surprisingly, the other heterokonts Aureococcus anophagefferens (outside of recognizable clusters) or Bacillariophyta such as Fragilaria radians, Fragilariopsis cylindrus, Phaeodactylum tricornutum or Pseudo-nitzschia multistriata (cluster #7). Cluster #3 contains sequences from Emiliania huxleyi (Ehu1, XP_005785285.1) and Chrysochromulina tobinii, whereas second putative FAP from Emiliania huxleyi (Ehu2, XP_005757666.1) clusters with the sequence 7 from the dataset B7 in between the clusters #1 and #2. Interestingly, clusters 4-6 feature multiple metagenomic sequences identified by Moulin et al. [78] but lack any currently available genomic sequences. Two metagenomic sequences from the cluster 6, 52837172 and 97429747, are annotated as belonging to Dinophyta Neoceratium fusus and Heterocapsa, respectively, by Moulin et al. [78]. Cluster #7 contains Bacillariophyta proteins as well as a number of metagenomic sequences. Cluster #8 contains Rhodophyta proteins. Finally, cluster #9 contains a number of metagenomic sequences and a single genomic one, that from the recently sequenced tiny marine green alga, Chlorophyta Chloropicon primus [114]. We note that the metagenomic sequences retrieved by us and by Moulin et al. [78] (sets B6 and B7) complement each other: whereas some are highly similar between the two sets, others are found in only one of the two sources.

Natural Diversity of FAP Active Sites
Having assembled a dataset of putative FAP sequences, we were interested to analyze the diversity of the active site and other residues in FAPs. We calculated the frequencies of observing a particular amino acid at a particular position in the multiple sequence alignment of FAP sequences ( Supplementary Datasets 6 and 7). Some genomic sequences, such as Phaeodactylum tricornutum sequence with the GenBank ID XP_002178042.1, and most, if not all, metagenomic sequences are partial. Still, these sequences provide important information and were included in the calculations. The results for the active site and fatty acid-binding pocket are shown in Figures 6-8. The multiple sequence alignment is provided as Supplementary Dataset 8. Frequencies of different amino acids are provided as Supplementary Dataset 9. The sequence logo of FAP sequences with residue numbers corresponding to CvFAP is provided as Supplementary Figure S2.
Catalysts 2020, 10, x FOR PEER REVIEW  14 of 22 with a proline (4.5% of sequences) or an alanine (2.7%). We note that mutation of Gly462 to a bulkier amino acid, in particular tyrosine, has allowed efficient kinetic resolution of α-functionalized carboxylic acids [84]. The other mutations tested in that work, A384K/Q/F/Y and L386K/Q/F/Y, had lower yields or selectivity [84]; Ala384 contacts His572 and Gln486, but not the fatty acid.  Finally, the amino acids surrounding the acyl moiety of the fatty acid are the most variable ( Figure 8). Whereas the acyl moiety is clearly hydrophobic, the surrounding amino acids are often polar. This could likely help in weakening the interactions between the products (alkanes and alkenes) and the enzyme and thus raising the turnaround of the reaction. with a proline (4.5% of sequences) or an alanine (2.7%). We note that mutation of Gly462 to a bulkier amino acid, in particular tyrosine, has allowed efficient kinetic resolution of α-functionalized carboxylic acids [84]. The other mutations tested in that work, A384K/Q/F/Y and L386K/Q/F/Y, had lower yields or selectivity [84]; Ala384 contacts His572 and Gln486, but not the fatty acid.  Finally, the amino acids surrounding the acyl moiety of the fatty acid are the most variable ( Figure 8). Whereas the acyl moiety is clearly hydrophobic, the surrounding amino acids are often polar. This could likely help in weakening the interactions between the products (alkanes and alkenes) and the enzyme and thus raising the turnaround of the reaction.

Structural Analysis
Representative structures of the proteins listed in Pfam [87] as harboring the GMC_oxred_N (PF00732) and GMC_oxred_C (PF05199) domains have been analyzed. The analyzed structures and their details are listed in Table 2. The structures were downloaded from PDB [115] and visualized using PyMOL [116].

Sequence Analysis
The sets of sequences mentioned and analyzed in this work are listed in Table 3. The seed Pfam sequences for GMC_oxred_N (PF00732) and GMC_oxred_C (PF05199) domains have been accessed on 12.04.2020 (Pfam version 32.0). The seed sequences were used to perform PSI-BLAST searches [113] against the non-redundant set of sequences in the NCBI database (all non-redundant GenBank coding sequence translations, PDB, SwissProt, PIR and PRF excluding environmental samples from whole genome sequencing projects), also on 12.04.2020. The cutoff E-value was chosen at 0.001 and 0.003 for GMC_oxred_N and GMC_oxred_C, respectively. The results of the PSI-BLAST searches were combined into a single set of sequences. Full-length sequences were clustered using UCLUST [117] at the 40% identity level. Multiple sequence alignment of the sequences in the dataset comprising the centroid sequences from clustering as well as other sequences of interest has been performed using the MAFFT FFT-NS-2 algorithm [118]. Multiple sequence alignment of the putative FAP sequences has been performed using the MAFFT L-INS-i algorithm [119]. For both tasks, MAFFT v. 7.402 [120] was used. In both alignments, columns containing more than 50% gaps were removed using trimAl [121]. The phylogenetic tree of representative GMC sequences was calculated using FastTree2 [122]. JTT+CAT amino acid substitution model was used, with 20 rate categories, 10 rounds Variation of the amino acids surrounding the isoalloxazine ring of FAD is shown in Figure 6. Besides the absolutely conserved Ala576, which discriminates FAPs from related GMC proteins, two other proximal amino acids are strictly conserved as well and may be important for catalysis: Asn575 and Gln620. Leu173 is also very well conserved. Surprisingly, some other amino acids in direct contact with FAD are quite variable. Ala158 is most often replaced with a cysteine, which potentially could form a covalent bond with the C8α atom of FAD, as observed in other flavoproteins [31]. Thr169 is often replaced with isoleucine or leucine; no histidines that could potentially form a covalent bond with the C8α atom of FAD are observed at this position in FAPs, unlike in other GMC proteins (Figure 3a,i,m). Finally, Ala171, which is close both to FAD and the carboxylate moiety of the fatty acid, is very often replaced with a bulkier valine.
Variation of the amino acids surrounding the carboxylate moiety of the fatty acid is shown in Figure 7. As expected, the catalytic amino acids Cys432 and Arg451 [77,79] are well conserved. The only observed sequence with the mutation C432S is of metagenomics origin; similar mutation of CvFAP renders it inactive [79]. Other conserved amino acids in this region are Gln486 (99.5% conserved) and His572 (97% conserved, and replaced with a cysteine in the rest of the sequences). Notably, Tyr466, which could potentially participate in the photodecarboxylation reaction [77,79], is replaced with a phenylalanine in 33% of sequences and with a leucine in 18% of sequences; this corresponds well to the observation that the Y466F variant of CvFAP is still active [79]. The hydrophobic amino acids in this region, Leu386, Val453 and Val463 are relatively variable. Interestingly, Gly431 is sometimes replaced with an alanine (17% of the cases) or even with bulkier methionine or a leucine (5% of sequences in total). On the other hand, Gly462 is only rarely replaced with a proline (4.5% of sequences) or an alanine (2.7%). We note that mutation of Gly462 to a bulkier amino acid, in particular tyrosine, has allowed efficient kinetic resolution of α-functionalized carboxylic acids [84]. The other mutations tested in that work, A384K/Q/F/Y and L386K/Q/F/Y, had lower yields or selectivity [84]; Ala384 contacts His572 and Gln486, but not the fatty acid.
Finally, the amino acids surrounding the acyl moiety of the fatty acid are the most variable ( Figure 8). Whereas the acyl moiety is clearly hydrophobic, the surrounding amino acids are often polar. This could likely help in weakening the interactions between the products (alkanes and alkenes) and the enzyme and thus raising the turnaround of the reaction.

Structural Analysis
Representative structures of the proteins listed in Pfam [87] as harboring the GMC_oxred_N (PF00732) and GMC_oxred_C (PF05199) domains have been analyzed. The analyzed structures and their details are listed in Table 2. The structures were downloaded from PDB [115] and visualized using PyMOL [116].

Sequence Analysis
The sets of sequences mentioned and analyzed in this work are listed in Table 3. The seed Pfam sequences for GMC_oxred_N (PF00732) and GMC_oxred_C (PF05199) domains have been accessed on 12.04.2020 (Pfam version 32.0). The seed sequences were used to perform PSI-BLAST searches [113] against the non-redundant set of sequences in the NCBI database (all non-redundant GenBank coding sequence translations, PDB, SwissProt, PIR and PRF excluding environmental samples from whole genome sequencing projects), also on 12.04.2020. The cutoff E-value was chosen at 0.001 and 0.003 for GMC_oxred_N and GMC_oxred_C, respectively. The results of the PSI-BLAST searches were combined into a single set of sequences. Full-length sequences were clustered using UCLUST [117] at the 40% identity level. Multiple sequence alignment of the sequences in the dataset comprising the centroid sequences from clustering as well as other sequences of interest has been performed using the MAFFT FFT-NS-2 algorithm [118]. Multiple sequence alignment of the putative FAP sequences has been performed using the MAFFT L-INS-i algorithm [119]. For both tasks, MAFFT v. 7.402 [120] was used. In both alignments, columns containing more than 50% gaps were removed using trimAl [121]. The phylogenetic tree of representative GMC sequences was calculated using FastTree2 [122]. JTT+CAT amino acid substitution model was used, with 20 rate categories, 10 rounds of nearest-neighbor interchanges, 2 rounds of optimization, 10 rounds of subtree-prune-regraft moves and the maximum length of a move of 10. The phylogenetic tree of putative FAP sequences was calculated using RAxML v. 8.2.12 [123]. CAT amino acid substitution model was used, with Dayhoff substitution matrix and 25 rate categories. The calculations were performed on the Cyberinfrastructure for Phylogenetic Research (CIPRES) portal [124]. Illustrations of the phylogenetic trees were prepared using FigTree v. 1.4.4 [125]. The Sequence logo for FAP proteins has been prepared using WebLogo with equiprobable reference amino acid composition [126]. Default parameter sets were used for all the algorithms unless stated otherwise.

Conclusions
GMC proteins are a large family of FAD-binding enzymes with great biotechnological potential. While some of them are characterized in detail and successfully employed in biotechnology, our work shows that many more may be found in nature. Some groups of putative GMC oxidoreductases are particularly interesting since they lack characterized representatives and may even employ new catalytic mechanisms since they lack the established catalytic residues.
One recently discovered subfamily, FAPs, presents a fascinating type of enzymes using the energy of light for catalysis. We show that genomic and metagenomic databases harbor more than 200 putative FAP genes, which may be more stable, more efficient, or have different substrate specificities compared to the already characterized FAPs. The data on sequence variation can also be used to engineer new FAPs with enhanced properties or guide the selection of new (not yet characterized) FAPs for application. Overall, our study may help in the development of environment-friendly biocatalytic processes and foster the transition from a carbon-based economy to the sustainable production of hydrocarbon fuels in the framework of the growing global bioeconomy.