A Glance into MTHFR Deficiency at a Molecular Level

MTHFR deficiency still deserves an investigation to associate the phenotype to protein structure variations. To this aim, considering the MTHFR wild type protein structure, with a catalytic and a regulatory domain and taking advantage of state-of-the-art computational tools, we explore the properties of 72 missense variations known to be disease associated. By computing the thermodynamic ΔΔG change according to a consensus method that we recently introduced, we find that 61% of the disease-related variations destabilize the protein, are present both in the catalytic and regulatory domain and correspond to known biochemical deficiencies. The propensity of solvent accessible residues to be involved in protein-protein interaction sites indicates that most of the interacting residues are located in the regulatory domain, and that only three of them, located at the interface of the functional protein homodimer, are both disease-related and destabilizing. Finally, we compute the protein architecture with Hidden Markov Models, one from Pfam for the catalytic domain and the second computed in house for the regulatory domain. We show that patterns of disease-associated, physicochemical variation types, both in the catalytic and regulatory domains, are unique for the MTHFR deficiency when mapped into the protein architecture.


Introduction
The one-carbon metabolism cycle, including the folate and methionine cycles, is a critical pathway for cell survival. The human enzyme methylenetetrahydrofolate reductase (encoded by the gene MTHFR, UniProt code: P42898)) exchanges one-carbon unit from the folate to methionine cycle. This is exclusively used for methionine and S-adenosylmethionine (SAM) synthesis, and MTHFR is the rate-limiting enzyme in the methyl cycle, undergoing allosteric inhibition by its end product SAM (S-Adenosil-Methionine) [1][2][3][4]. The protein is functional in its homodimeric form [5].
MTHFR catalyzes the conversion of 5,10-methylenetetrahydrofolate to 5-methyltetrahydrofolate, a co-substrate for homocysteine re-methylation to methionine (EC number: 1.5.1.20). The recent release of its structure (PDB code: 6FCX, 0.25 nm resolution) highlights the organization of the protein into two flexible domains, one catalytic and one regulatory, with a connecting linker allowing domain-domain interactions, possibly due to a phosphorylation cascade. The structure clarifies the molecular mechanism of the reaction, which requires FAD as a cofactor, NAD(P)H to provide reducing equivalents and homodimerisation for allosteric regulation upon SAM binding at the regulatory domain [6]. The 36-residue N-terminal portion is not resolved in the available PDB file and MobiDB predicts only here a flexible region (https://mobidb.bio.unipd.it/P42898 accessed on 10 October 2021). The PDB contains the homodimeric protein organization.
MTHFR deficiency and upregulation result in various disease states, which have been extensively described in relation to a number of variants characterized in many studies. 109 MTHFR mutations have been reported in 171 families, including 70 missense mutations, 17 that primarily affect splicing, 11 nonsense mutations, seven small deletions, two no-stop mutations, one small duplication, and one large duplication [7]. Two other variants, A222V and E429A, distributed worldwide in the population, are characterized by a reduced enzymatic activity, and are associated to different risk factors [8,9]. Variations, reducing the MTHFR activity to different extents, result in hyperhomocysteinemia and varying severities of disease, including ischemic stroke, folate sensitive neural tube defects and schizophrenia [1]. Evidently, the protein is also an attractive drug target [10]. All known missense variations are distributed in the three-dimensionally resolved catalytic and regulatory domains.
In this study, we are interested in exploiting with computational tools the structural properties of the protein missense variations associated to the disease to highlight possible mechanisms of protein destabilization due to residue change. To this aim we first map disease variations on the protein structure in relation to their solvent accessibility and compute for the accessible variations their likelihood of being involved in protein-protein interactions. We also compute the Gibbs free energy change (∆∆G) for each variation with a consensus method and find that positions 387 (G387D), 506 (Y506D) and 628 (L628T) of the protein homodimeric interface at the level of the two regulatory domains, besides being correctly predicted as interaction sites, are also destabilizing the protein homodimer. This corroborates the relevance of the interaction of the two regulatory domains for the stability of the functional protein. We then grouped all the disease-related variations according to their physicochemical types and mapped them into the computed HMM modelled architecture of the protein. By this we establish a link among protein domains and variation types, which is a unique marker of the MTHFR deficiency.

Results
Application of state-of-the art tools for functional annotation of a protein is common routine in the field of computational biology. Here, having the solved structure of the MTHRF gene, we aim at highlighting possible structural properties of the missense variations associated to the deficiency, and most cases associated to a decreased biochemical efficiency. Our goal is to relate computational properties, such as solvent exposure, being an interaction site and promote protein instability, to their annotation of being diseaseassociated. This highlights some interesting properties of the disease related variations and in turn benchmarks tools in the difficult task of their prediction.

MTHFR and Protein-Protein Interactions
Large scale experiments of interactomics indicate that human MTHFR interacts with many specific partners. BioGRID (https://thebiogrid.org/ accessed on 10 October 2021), the Database of Protein, Genetic and Chemical Interactions, lists 33 physical interactors, 22 of which are also present in IntAct (https://www.ebi.ac.uk/intact/ accessed on 10 October 2021), the other molecular data base collecting data from large scale experiments. It is worth noticing that none of the enzymes involved in the folate and methionine cycles are present among the physical interactors, and that many membrane and nuclear proteins are in the interacting protein pool. Why this is so perhaps deserves more experiments, and it can be interpreted considering the presence of MTHFR in different cell compartments, including its putative interaction with mitochondria, endoplasmic reticulum, and the nucleus [1]. For the time being, we can compute the likelihood of solvent-exposed residues to be in contact with a putative partner. We adopt our ISPRED4 predictor [11], which is based on machine learning, and it is specifically suited to compute the likelihood of an exposed residue to be involved in a protein contact. We compute 44 interacting sites in the protein structure (see Supplementary Material Table S1 for details), 17 of which are at the interface between the two regulatory domains of the homodimer. Many of the interactions reported in the databases are likely to be non-obligate and therefore different interactions can involve the same sites, in different compartments and phases of the cell lifespan. This can be considered the reason why the number of interaction sites as derived from a structure rarely coincides with the number of interactors as derived from large scale experiments. In the following, our interest is on disease-related variations which are in interaction sites and whose functional annotations are already documented (Table 1).
In Figure 1, predicted contacts are represented with hard spheres centered on the C-alpha atom of the specific residue. The color code follows the organization of the protein in the catalytic (yellow) and regulatory (pale blue) domain, inclusive of the linker region [6]. The bound FAD and SAH molecules, present in the protein crystal (6FCX), are also shown for clarity, and their binding sites are relevant for protein catalytic activity. The catalytic and regulatory domains are depicted in yellow and pale blue, respectively. Interaction sites are represented with hard spheres centered on the C-alpha atom of the specific residue. Grey spheres are residues in the homodimeric interface, correctly predicted as interaction sites.
Interestingly enough, we correctly predict the interface region of the homodimer (hard spheres in grey). Other predicted PPI sites are distributed in different regions of the protein surface. These residues are candidates for taking part in the interaction with the 33 proteins reported in the IntAct and BioGRID databases. Only in position 387 (G387D), 506 (Y506D) and 628 (L628T) of the homodimeric interface at the level of the regulatory domain do the predicted interaction sites coincide with missense variations associated with the MTHFR disease. These variations are also predicted as destabilizing (see below). This observation finally highlights the role of the regulatory domain interactions not only in being part of the protein functional stability, but also in playing a role in the disease [5].

MTHFR and Protein Stability
We can investigate whether disease-related missense variations are related to protein instability. To this aim, we adopt a consensus method, computing (with three state-of-theart methods) the Gibbs free energy change (∆∆G) associated with a specific variation in the protein. We select a consensus method, given the variability of the different methods in predicting the ∆∆G values [12], and adopt three of the art methods: INPS-MD [13] is based on machine learning, FoldX [14] on statistical potentials, and PoPMuSiC2 [15] on statistical potentials and machine learning.  [7]. * Variations are described in [8] (A222V) and [9] (E229), respectively. Effects of the variations on the MTHFR enzymatic activity are listed when reported. Bold style indicates variations for which at least two of the three methods adopted for computing ∆∆G (INPS3D, [13]; FoldX [14]; and PoPMuSiC2 [15], compute negative results, lower than −1 kcal/mol, indicating protein destabilization (for details see text). For completeness, we include results (I, Interaction; N, No Interaction) of the Interaction site prediction method (ISPRED4) [11] and values of the relative solvent accessibility (RSA%) (see Materials and Methods for details) (second to last column and right-most column, respectively).
We select as a threshold value |1 kcal/mol|, which takes into account the variability of the experimental thermodynamic data on protein stability adopted for training the predictors. In Table 1, we list the 42 disease-related variations in the catalytic domain and the 30 disease related variations in the regulatory domain. Alongside this, we indicate the corresponding effects on the protein function, the computed ∆∆G according to the three predictors, the prediction of the wild-type residue to be in contact or not and the computed relative solvent accessibility [16]. It appears that 22 variations in the catalytic domain and 20 variations in the regulatory domain decrease protein stability, according to at least two of the three predictors. Among the remaining ones, seven are in the NAD(P)H binding site, and the other seven are in the FAD binding site, respectively. These variations, as reported in Table 1, decrease the binding affinity without perturbing the protein stability, including A222V and E429A. In any case, the available structure 6FCX contains an A in position 429 instead of E, and in this case, we compute ∆∆G of the reverse variation [12]. R325C, in the substrate-binding site, decreases substrate affinity without affecting protein stability.
In the protein catalytic domain, we map 41 disease related variations, 14 of which are exposed and not in interaction sites, 20 are destabilizing and mostly (90%) buried. In the protein regulatory domain, we map the remaining 31 disease related variations, 12 of which are exposed, 21 are destabilizing and 15 of these are buried. Interestingly, positions 387 (G387D), 506 (Y506D) and 628 (L628T) at the protein homodimeric interface, correctly predicted as interaction sites (see above), promote also protein destabilization, supporting a role of the regulatory domain interactions in the stability of the functional protein homodimeric complex. Overall, 61% of the disease related variations are affecting the protein stability and most of them have been experimentally found to promote instability of cofactor binding.
Out of the pool of the MDRFH deficiency variations listed in Table 1, UniProt in the protein file P42898 lists eight other variations with a dbSNP code (https://www.ncbi.nlm.n ih.gov/snp/ accessed on 10 October 2021) not yet associated to disease (likely benign?). Six of these maps into the protein structure and two of them, (G422R, exposed, and G566E, at the homodimer interface) destabilize the protein structure according to our criterion. Results are in line with previous observations highlighting how protein instability is not a necessary condition for being disease-related [17,18], although in this specific case many of the variations are indeed destabilizing the protein organization (Table 1).

MTHFR Deficiency and Its Structural Model
Recently we introduced the concept of mapping disease variation types into associated Pfam structural protein models (https://pfam.xfam.org accessed on 10 October 2021), finding that by this it is possible to establish a relation among genes and maladies [18,19]. Indeed, mapping of variation types into Pfam is unique for a given disease. Here we exploit our strategy with MTHFR, considering Pfam 02219 for the catalytic domain. This Pfam is shared by similar proteins in Eukarya, Bacteria and Archaea. The regulatory domain does not have a Pfam model and is present only in Eukarya. We build a model for the regulatory domain aiming for a structural representation of the complete protein. The model is an HMM of the profile of a multiple alignment of some 50 sequences from Eukarya with a length similar to that of MTHFR. The model includes the linker, and it spans from residue 336 to residue 566 (see Supplementary Material, where the Pfam-like model of the regulatory domain is reported). We then converted the disease related variations of Table 1 into variation types (apolar (G, A, V, I, L, P, M); polar (S, T, C, N, Q, H); aromatic (F, W, Y); charged (D, E, K, R) giving rise to 16 possible variation types. We associated MTHFR related variation types to the protein architecture, as represented by P02219 and our Pfam-like model of the regulatory domain. The frequency of the variation types in each domain is represented in Figure 2.
It appears that the variational pattern is different in the two domains and different from the background variational pattern, obtained considering the pathogenic variations from Humsavar (https://www.uniprot.org/docs/humsavar accessed on 10 October 2021), in 2513 human proteins (22,763 disease related variations). In variation types, labels are as follows: a, apolar; c, charged; p, polar; and r, aromatic (for details see text). Differences between Catalytic and Regulatory sites are significant at 10% when a Chi-square test is applied after adding pseudocounts (with value 0.5) for regularization.

Characterization of Protein Surface and Annotation of Protein-Protein Interaction Sites
The solvent accessibility of residues of PDB entry 6FCX, chain A, [6] was computed with the DSSP program (https://swift.cmbi.umcn.nl/gv/dssp/DSSP_3.html accessed on 10 October 2021) and normalized with respect to the residue-specific maximal accessibility values as previously described [16]. Residues interacting in the homodimer interface are those undergoing a decrease of the absolute solvent accessibility (ASA) ≥ 1 Å 2 in the complex with respect to the isolated monomer.
Protein-protein interaction sites were predicted with ISPRED4 [11], a tool based on support vector machines and grammatical restrained hidden conditional random fields that integrate 46 different features extracted from the monomer sequence, its multiple sequence alignment against the UniProt database and its 3D structure. ISPRED4 has been trained and cross-validated on 151 protein complexes and reaches a per-residue Matthews correlation coefficient of 0.48 and an overall accuracy of 0.85. Similar values are obtained on blind test sets, and therefore ISPRED4 is one of the top-performing tools for the computational annotation of protein-protein interaction sites.

Prediction of ∆∆G Changes upon Single Residue Variation
The possible effect on protein stability induced by single residue variation starting from protein structure has been predicted with three state-of-the-art methods: (i) INPS3D [13], a tool based on a machine-learning approach; (ii) FoldX [14], that estimated energy changes on the basis of a knowledge-based potential; and (iii) PoPMuSic2 [15], a method implementing a combination of statistical potentials optimized with a neural network. The following convention has been adopted for the definition of the ∆∆G sign: where ∆G wt and ∆G mut are the folding free energy of the wild-type and mutated proteins, respectively. Negative values of ∆∆G mean that the mutated form is less stable than the wild-type. We considered as destabilizing the variations for which at least two methods predict ∆∆G ≤ −1 kcal/mol. Since the structure 6FCX carries the mutant allele (A) in position 429, the thermodynamic effect of variation E429A was estimated by computing the ∆∆G of variation A429E on the crystal and applying the antisymmetric principle.

Pfam-Like Model of the Regulatory Domain
From the UniRef50 cluster UniRef50_P42898 (https://www.uniprot.org/uniref/Uni Ref50_P42898 accessed on 10 October 2021), we collected 150 complete protein sequences from Eukarya and with length ranging between 640 and 670 residues. These sequences cover both domains of human MTHFR protein and share more than 50% sequence identity with it. We aligned the sequences with ClustalOmega (https://www.ebi.ac.uk/Tools/msa/ clustalo/ accessed on 10 October 2021) and extracted the multiple sequence alignment of the regulatory domain, spanning from position 336 to 566 in the human sequence. We then trained a HMM, with HMMER 3.3.2 (http://hmmer.org/ accessed on 10 October 2021). The trained model is available in the Supplementary Materials.

Conclusions
In this paper we exploit different computational methods for refining the annotation of the disease related variants of MTHFR, promoting MTHFR deficiency. Due to the numerous biological processes in which the protein is directly and/or indirectly involved, MTHFR is of particular interest, since its partial or total disfunction may have a range of effects on human health, spanning from mild to lethal ones. Among the known 72 disease related variations, we characterize those that are at the protein surface, participate into proteinprotein contacts and are at the homodimer interface which involves the protein regulatory domain. We also highlight other properties of the protein, like the exposed residues that eventually participate in the protein-protein interaction (Table S1). Furthermore, we show that 61% of the disease related variants are destabilizing the protein, highlighting a possible source of structural destabilization causing the decreased binding affinity of the protein cofactors when documented. Noteworthy is that positions 387 (G387D), 506 (Y506D), and 628 (L628T) in the interface of the two regulatory domains of the homodimeric protein, besides being disease associated, are correctly predicted as interaction sites, and predicted also as destabilizing. This confirms the role of the regulatory domains interaction in supporting the homodimeric functional unit [5].
Finally, we propose a structural variational model for MTHFR deficiency by associating variation types to the protein architecture, as modelled with HMMs representing the catalytic and regulatory domain, respectively.