SARS-CoV-2 Membrane Protein: From Genomic Data to Structural New Insights

Severe Acute Respiratory Syndrome CoronaVirus-2 (SARS-CoV-2) is composed of four structural proteins and several accessory non-structural proteins. SARS-CoV-2’s most abundant structural protein, Membrane (M) protein, has a pivotal role both during viral infection cycle and host interferon antagonism. This is a highly conserved viral protein, thus an interesting and suitable target for drug discovery. In this paper, we explain the structural nature of M protein homodimer. To do so, we developed and applied a detailed and robust in silico workflow to predict M protein dimeric structure, membrane orientation, and interface characterization. Single Nucleotide Polymorphisms (SNPs) in M protein were retrieved from over 1.2 M SARS-CoV-2 genomes and proteins from the Global Initiative on Sharing All Influenza Data (GISAID) database, 91 of which were located at the predicted dimer interface. Among those, we identified SNPs in Variants of Concern (VOC) and Variants of Interest (VOI). Binding free energy differences were evaluated for dimer interfacial SNPs to infer mutant protein stabilities. A few high-prevalent mutated residues were found to be especially relevant in VOC and VOI. This realization may be a game-changer to structure-driven formulation of new therapeutics for SARS-CoV-2.


Introduction
COronaVIrus Disease 2019 (COVID-19) is currently a worldwide pandemic that was first reported in December 2019 in Wuhan, China, and, since then, led to more than 446 M infected people and over 6.00 M deaths [1] (as of 7 March 2022). COVID-19 is caused by Severe Acute Respiratory Syndrome CoronaVirus-2 (SARS-CoV-2), which is a Coronaviridae family, positive single-stranded RiboNucleic Acid (ssRNA) virus [2,3]. Since the beginning of this pandemic, SARS-CoV-2 has mutated over time, leading to the identification of several variants that, based on phylogeny [4], have been organized into clades named L, S, V, G, GH, GR, GV, GRY, and O (clade based on exclusion encompassing sequences that do not fit into other clades) [5,6].
According to the World Health Organization (WHO), there are Variants Of Interest (VOI), variants that have been recognized as being able to acquire community transmission causing clusters and being further identified in several countries, or assessed as a VOI SARS-CoV-2 M protein residues C64, C86, and C159 may also not be involved in M dimer interface [13].
Bioinformatic tools are well established methodologies that allow to attain a structural and functional characterization of relevant biomedical targets for viral infections and to predict new virus-host interactions [29,30]. In this work, through an inhouse developed in silico approach (Figure 1), we modelled the M protein monomer and dimer three-dimensional (3D)-structures along with predictions for their membrane orientation and homodimeric interface. We also predicted the impact of mutations in the predicted homodimeric interface, and the best region for the binding of new drugs/peptides, paving the way to structure-driven formulation of new therapeutic solutions. Figure 1. Project Pipeline. M protein structure was predicted by AlphaFold [31]. (1) Membrane orientation was predicted with Orientations of Proteins in Membranes (OPM) [32], prediction of Transmembrane Helices (TMpred) [33], TransMembrane prediction using cyclic Hidden Markov Model (TMHMM) [34,35], Prediction of secondary structure (PSIPRED) [36,37], Consensus Constrained TOPology prediction (CCTOP) [38,39], and Sequence Analysis and Consulting Service MEMbrane protein Structure And Topology (SACSMEMSAT) [40]. Protein-membrane systems were constructed with Chemistry at HARvard Macromolecular Mechanics Graphical User Interface (CHARMM-GUI) [41] and minimization and equilibration were conducted using GROningen MAchine for Chemical Simulations (GROMACS) [42,43]. (2) M protein dimer was predicted with High Ambiguity Driven protein-protein DOCKing (HADDOCK) [44] and results were compared to SARS-CoV experimental data. (3) Gene and protein mutations were analyzed with Microbial Genomics Mutation Tracker (MicroGMT) [45] and Rahman et al. [46] programs and energy variation of mutations in dimer interaction residues were calculated with FoldX [47]. (4) Druggable residues in SARS-CoV-2 Membrane protein dimer were predicted through FTMap [48] hotspot clusters.

Results
Due to the lack of an M-protein experimental structure, we have used the AlphaFold predicted structure and subjected it to a MD simulation protocol. AlphaFold [31] algorithm was shown to predict protein structures, attaining experimental resolution, as already demonstrated in the challenging 14th Critical Assessment of protein Structure Prediction (CASP14). Furthermore, Leo et al. [49] showed the potentiality of using 3D structures of proteins attained by machine learning algorithms, and further refined them by a physics-based protocol, as followed in our study. Figure 1. Project Pipeline. M protein structure was predicted by AlphaFold [31]. (1) Membrane orientation was predicted with Orientations of Proteins in Membranes (OPM) [32], prediction of Transmembrane Helices (TMpred) [33], TransMembrane prediction using cyclic Hidden Markov Model (TMHMM) [34,35], Prediction of secondary structure (PSIPRED) [36,37], Consensus Constrained TOPology prediction (CCTOP) [38,39], and Sequence Analysis and Consulting Service MEMbrane protein Structure And Topology (SACSMEMSAT) [40]. Protein-membrane systems were constructed with Chemistry at HARvard Macromolecular Mechanics Graphical User Interface (CHARMM-GUI) [41] and minimization and equilibration were conducted using GROningen MAchine for Chemical Simulations (GROMACS) [42,43]. (2) M protein dimer was predicted with High Ambiguity Driven protein-protein DOCKing (HADDOCK) [44] and results were compared to SARS-CoV experimental data. (3) Gene and protein mutations were analyzed with Microbial Genomics Mutation Tracker (MicroGMT) [45] and Rahman et al. [46] programs and energy variation of mutations in dimer interaction residues were calculated with FoldX [47]. (4) Druggable residues in SARS-CoV-2 Membrane protein dimer were predicted through FTMap [48] hotspot clusters.

Results
Due to the lack of an M-protein experimental structure, we have used the AlphaFold predicted structure and subjected it to a MD simulation protocol. AlphaFold [31] algorithm was shown to predict protein structures, attaining experimental resolution, as already demonstrated in the challenging 14th Critical Assessment of protein Structure Prediction (CASP14). Furthermore, Leo et al. [49] showed the potentiality of using 3D structures of proteins attained by machine learning algorithms, and further refined them by a physicsbased protocol, as followed in our study.

M Protein Monomer Structure and Membrane Orientation
M protein is a membrane protein and the determination of its correct orientation in the lipid bilayer membrane is needed to understand its main interactions, and, therefore, its biological function. To this end, six different web-based resources for membrane orientation prediction were used: OPM [32], TMpred [33], TMHMM [34,35], PSIPRED [36,37], CCTOP [38,39], and SACSMEMSAT [40].

M Protein Monomer Structure and Membrane Orientation
M protein is a membrane protein and the determination of its correct orientation in the lipid bilayer membrane is needed to understand its main interactions, and, therefore, its biological function. To this end, six different web-based resources for membrane orientation prediction were used: OPM [32], TMpred [33], TMHMM [34,35], PSIPRED [36,37], CCTOP [38,39], and SACSMEMSAT [40].
M protein Root-Mean-Square-Deviation (RMSD) results were obtained considering residues from the whole protein (monomer RMSD), TMH1, TMH2, C-terminal, and Nterminal, using the AlphaFold monomer structure as reference. M protein monomer predicted residue domains, after system equilibration by Molecular Dynamic (MD) simulations, were very similar for all membrane orientation predictions. Monomer RMSD values were 1. 61 Figures S1 and S2). For the following dimer prediction study, PSIPRED results were not used as RMSD values were higher for both monomer and transmembrane helices 1 and 2 RMSDs. Despite SACSMEMSAT and CCTOP having comparable values to the other predictors, they showed an arched TMH1 after an equilibration MD simulation that could influence dimer stability (Supplementary Figure  S2). Hence, out of the six membrane predictors used initially, OPM, TMHMM, and TMpred M protein monomers were chosen for further analysis.
OPM, TMpred, and TMHMM M monomers from the previous step were used to model dimer 3D structures using a well-established protein-protein docking software: HADDOCK [44]. From 3000 proposed docking decoys, 1000 for each membrane orientation, 20 dimer structures that respected the membrane orientation prediction were selected: 11 from OPM, 4 from TMpred, and 5 from TMHMM. From these 20 dimers, two structures from the TMHMM membrane predictor were chosen based on their similarity with SARS-CoV experimental detected interactions, namely in TMH2 (P59) and TMH3 (W92, L93, F96) regions [13]. From these two TMHMM M protein dimers, the final choice was based on PROtein binDIng enerGY (PRODIGY)'s metrics of biological probability and predicted binding affinity. Hence, the M protein dimer structure chosen for the proceeding studies showed 85.6% biological probability and a predicted binding affinity of −6.3 kcal/mol in comparison to 74.8% biological probability and −5.9 kcal/mol binding affinity results from the other chosen structure. Regarding the TMHMM monomer membrane prediction that served as template for the final chosen dimer, M protein monomer residues 11-19 were shown to stably belong to N-terminal domain, residues 100-203 to C-terminal domain, residues 20-38 to TMH1, residues 46-70 to TMH2, and residues 76-100 to TMH3 ( Figure 2).

Dimer Prediction
The final dimer 3D structure ( Figure 3) was subjected to three independent dimer system MD replicas of 0.5 µs. Minimum distance, radius of gyration and RMSD results in function of MD simulation time are further described in Supplementary Figures S3-S5, respectively. After equilibration, polar contacts between M protein monomer and membrane lipids occurred in M monomer residues K14, Y39, R42, N43, R44, F45, Y71, R72, W75, S94,  R101, R107, W110, S173, and R174. Transmembrane regions were within membrane lipids throughout the entire equilibration and several M protein residues were able to establish polar contacts with membrane lipids, supporting our transmembrane region assessment ( Figure 3). equilibration in membrane with ER membrane composition. (c) M protein structure with domains highlighted.

Dimer Prediction
The final dimer 3D structure (Figure 3) was subjected to three independent dimer system MD replicas of 0.5 μs. Minimum distance, radius of gyration and RMSD results in function of MD simulation time are further described in Supplementary Figures S3-S5,  respectively. After equilibration, polar contacts between M protein monomer and  membrane lipids occurred in M monomer residues K14, Y39, R42, N43, R44, F45, Y71, R72,  W75, S94, R101, R107, W110, S173, and R174. Transmembrane regions were within membrane lipids throughout the entire equilibration and several M protein residues were able to establish polar contacts with membrane lipids, supporting our transmembrane region assessment (Figure 3).  Figure S6) showed that monomer A and monomer B behaved differently throughout the MD simulation. In monomer A, TMH3 domain was the most stable region. Monomer A TMH2 domain interacted with monomer B and was a bit more unstable when compared with TMH1 domain (Supplementary Figure S6a). In monomer B, TMH domains were also very stable, presenting however a much higher deviation and lower stability of the N-and C-terminus compared with other domains (Supplementary Figure S6b). Root-Mean-Square-Fluctuation (RMSF) results (Supplementary Figure S7) for monomer A and monomer B were very similar. As expected, TMH residues, in large majority α-helices, showed low fluctuation, whilst Cterminus residues, present in a random coil, presented higher fluctuation. Cross-Correlation Analysis (CCA) results (Supplementary Figure S8) showed that within both monomers, TMH2 is highly positively correlated (moves in the same direction) with TMH1 and TMH3 within the same protein. On the contrary, between monomers, TMH1 and TMH2 showed a negative correlation (moving in opposite directions) with remaining helices of the opposite monomer. Average ΔSolvent accessible surface area (SASA) for the  Figure S6) showed that monomer A and monomer B behaved differently throughout the MD simulation. In monomer A, TMH3 domain was the most stable region. Monomer A TMH2 domain interacted with monomer B and was a bit more unstable when compared with TMH1 domain (Supplementary Figure S6a). In monomer B, TMH domains were also very stable, presenting however a much higher deviation and lower stability of the N-and C-terminus compared with other domains (Supplementary Figure S6b). Root-Mean-Square-Fluctuation (RMSF) results (Supplementary Figure S7) for monomer A and monomer B were very similar. As expected, TMH residues, in large majority α-helices, showed low fluctuation, whilst C-terminus residues, present in a random coil, presented higher fluctuation. Cross-Correlation Analysis (CCA) results (Supplementary Figure S8) showed that within both monomers, TMH2 is highly positively correlated (moves in the same direction) with TMH1 and TMH3 within the same protein. On the contrary, between monomers, TMH1 and TMH2 showed a negative correlation (moving in opposite directions) with remaining helices of the opposite monomer. Average ∆Solvent accessible surface area (SASA) for the interfacial residues only showed small variations, further strengthening the stability of the established homodimer interface (Supplementary Figure S9).

M Protein Mutation Analysis
We retrieved 1271550 M protein sequences, submitted between 10 January 2020 and 3 May 2021 from 180 countries, from the Global Initiative on Sharing All Influenza Data (GISAID) [50,51] database. Genomic sequences were obtained from human hosts, with more than 29,000 bases per sequence, and less than 5% missing values. The sequence distribution retrieved across GISAID clades and across the world can be observed in Supplementary Figure S9. Clades S, G, GH, and GR encompass sequences that are most prevalent in North America. The latter clade is also well represented in the Oceania region. Clades GV and GRY are most prevalent in Europe and clades O and L are sparse across the world (Supplementary Figure S10). Within the M protein interfacial residues from analyzed sequences, 91 Single Nucleotide Polymorphisms (SNPs) were retrieved from 21868 sequences. FoldX was used to assess the binding free energy differences between mutated and Wild-Type (WT) proteins ΔΔGbinding) and the respective values by physico-chemical character of the analyzed mutation are illustrated in Figure 6 and with higher detail in Supplementary Figure S11.  [44] prediction using TMHMM [34,35] based monomers with interfacial residues represented as sticks, and (b) interface zoom-in featuring interfacial residues identified with the color code of teal for Monomer A and garnet for Monomer B.

M Protein Mutation Analysis
We retrieved 1271550 M protein sequences, submitted between 10 January 2020 and 3 May 2021 from 180 countries, from the Global Initiative on Sharing All Influenza Data (GISAID) [50,51] database. Genomic sequences were obtained from human hosts, with more than 29,000 bases per sequence, and less than 5% missing values. The sequence distribution retrieved across GISAID clades and across the world can be observed in Supplementary Figure S9. Clades S, G, GH, and GR encompass sequences that are most prevalent in North America. The latter clade is also well represented in the Oceania region. Clades GV and GRY are most prevalent in Europe and clades O and L are sparse across the world (Supplementary Figure S10). Within the M protein interfacial residues from analyzed sequences, 91 Single Nucleotide Polymorphisms (SNPs) were retrieved from 21868 sequences. FoldX was used to assess the binding free energy differences between mutated and Wild-Type (WT) proteins ∆∆G binding ) and the respective values by physicochemical character of the analyzed mutation are illustrated in Figure 6 and with higher detail in Supplementary Figure S11. . ΔΔGbinding values of predicted interfacial residues that possessed mutations capable of causing major impact in protein stability. Color represents the alteration from aromatic to nonaromatic (teal), non-aromatic to aromatic (yellow), and non-aromatic to non-aromatic (garnet) (all the presented results are mean values ± standard deviation).

Single Mutation Analysis
To assess the mutation effect at the dimer complex, we have focused on the reported mutations at the predicted interface. In these considered regions the vast majority of the mutations did not impact protein stability in a significant manner (Supplementary Figure  S11, Table 2). Most mutations were found in non-polar residues and did not significantly impact protein stability. As expected, changes in polarity, which could have higher consequences in the microenvironment around them, led to more significant alterations of dimer stability. This was particularly true when changing from a polar to a non-polar residue (Supplementary Figure S11, Table 3). Aromaticity changes also seem key in this Figure 6. ∆∆G binding values of predicted interfacial residues that possessed mutations capable of causing major impact in protein stability. Color represents the alteration from aromatic to nonaromatic (teal), non-aromatic to aromatic (yellow), and non-aromatic to non-aromatic (garnet) (all the presented results are mean values ± standard deviation).

Single Mutation Analysis
To assess the mutation effect at the dimer complex, we have focused on the reported mutations at the predicted interface. In these considered regions the vast majority of the mutations did not impact protein stability in a significant manner (Supplementary Figure S11, Table 2). Most mutations were found in non-polar residues and did not significantly impact protein stability. As expected, changes in polarity, which could have higher consequences in the microenvironment around them, led to more significant alterations of dimer stability. This was particularly true when changing from a polar to a non-polar residue (Supplementary Figure S11, Table 3). Aromaticity changes also seem key in this dimer interface as mutations from non-aromatic residues towards aromatic represent the predominant change. None of the possible aromatic changes led to a significant change in protein stability. (Supplementary Figure S11, Table 4).  SNP I82T, located at the TMH3 domain, was the most common SNP detected. This mutation led to the residue's polarity modification from a non-polar residue into a polar one and occurred in 6316 (28.88%) sequences from our dataset. The second most frequent SNP was V70L, at the end of the TMH2 domain. This mutation did not change the type of polarity at that specific position and was detected in 6303 (28.82%) sequences. Physicochemical properties of valine and leucine are similar, and V70L mutation was shown to be inconsequential for dimer stability. However, this is not true for isoleucine and threonine, two amino acids with different polarity, and I82T leads to a stability gain. Interestingly, these two residues, I82 and V70, are interacting strongly throughout the entire MDs simulation (Table 1, Figure 5). These were by far the most common SNPs, with the third most common one occurring in only 1455 sequences (more details in Supplementary Table S1).

Mutation Distribution in Variants
We also analyzed the type of mutation found in each known clade (Supplementary Figure S11, Supplementary Table S1-single mutations and Table S2-co-occurring mutations). The most common mutated clade was GRY, and the most frequent mutation found in this clade was V70L (Table 5). This mutation co-occurred in GRY with M109L (8 cases), A104V (2 cases), and A69F (1 case) without any major identifiable energetic advantage (∆∆G binding around 0 kcal/mol) (Supplementary Table S2). The second most frequent mutated clade, where VOCs are also located, was GH and the most frequent mutation in this clade was I82T (Table 5). A few mutations also co-occurred with I82T but in low frequency. From these, A85S induced a higher stabilization of the dimer interface (∆∆G binding value of −1.47 ± 0.47 kcal/mol) (Supplementary Table S2). G clade was the third most mutated clade, and the most frequent one was I82T (Table 5). A few double mutations of interfacial residues were also found, in particular I82T-R107L (4 cases), I82T-V70F (2 cases), I82T-M109I (2 cases), I82T-V66M (2 cases), I82T-A85S (2 cases), and I82T-R107H (2 cases) but none led to higher changes in the binding free energy (Supplementary Table S2). GR clade was the following most mutated and the most common mutation in this clade was V70F (Table 5). A few mutations were found in association, such as A85S (3 cases, ∆∆G binding = −0.72 ± 0.64 kcal/mol) and A104V (1 case, ∆∆G binding = 0.10 ± 0.54 kcal/mol) (Supplementary Table S2). The remaining clades were much less populated with mutated sequences.   Table S2). The remaining clades were much less populated with mutated sequences.

Druggable Hot Spots
Solvent occlusion has already been demonstrated as a key aspect of PPIs, as main interfacial residues SASA values are considerably more diminished upon complex formation compared to other interfacial residues [52][53][54][55][56][57]. Two metrics are particularly relevant, ∆SASA and relSASA, as the first one allows us to quantify occlusion upon complex formation and the second, which represents the quotient between ∆SASA and SASA monomer , allows us to distinguish between residues with the same ∆SASA but with different solvent accessibility in the monomer. The most mutated residues such as I82, V70, and A69 showed higher ∆SASA and relSASA values, which indicates occlusion of these residues upon complex formation, with SASA complex values tending to zero (higher ∆SASA and rel SASA closer to 1) ( Table 6, Supplementary Table S1). Other frequently mutated residues lose accessibility to the solvent but remain attainable in the complex form: e.g., M109, A104, R107, and W75. By preventing bulk water to approximate these interfacial residues, the number and force of interaction established increases and the PPI is strengthened. Residues V70, M109, and I82 established a high number of dimer interactions: 6, 5, and 4, respectively. On the other hand, residues A69, R107, and W75 established two interactions each and residue A104 established only one interaction (Table 6, Supplementary Table S1). In order to understand if any of the previously mentioned mutation points were druggable residues, the production MD phase was clustered and subjected to FTMap [48,[58][59][60], to further characterize potential ligand hotspot positions for new drugs. FTMap [48,[58][59][60] probes were shown to cluster into two different positions: one on the transmembrane zone near to C-terminal (teal) and another between TMH2 and TMH3 (garnet), deeper into the membrane (Figure 8). The teal cluster displayed the highest FTMap [48,[58][59][60] score and the attained probes interacted with M protein dimer predicted interface residues F96, I97, F100, and F103 in Monomer A and with residues W55, S108, M109, S111, and F112 in Monomer B. The bulk of mutations for the residues that interact with this probe are mutations that don't impact protein stability or have a positive effect on stability (Table 7). Probes from this cluster also interacted with Monomer B residues Y47, L51, I52, Y95, W110, N113, P114, and G115, but these are not predicted homodimer interacting residues.

M Protein Monomer Structure and Membrane Orientation
At the moment, a few predictions for the M protein monomer were made as is the case for Heo and Feig [49], Zhang et al. [61], and AlphaFold [62]. In this work our starting point 3D model was the one developed by a state-of-the-art methodology, AlphaFold's, which was also determined as adequate and in consensus with other predictions [49]. We predicted its membrane orientation using six different membrane orientation softwares. After minimization and MD equilibration, we chose TMHMM M protein monomer membrane orientation prediction for the following studies since it showed a higher stability, with low RMSD values upon comparison with the initial AlphaFold's structure, and without any major conformational change. SARS-CoV M protein monomer domains were previously predicted in experimental research that elucidated M protein dimer interactions [13]. In that experiment, residues 15-37 were shown to belong to TMH1, residues 50-72 to TMH2 and residues 77-99 to TMH3 [13]. Herein, for the first time, a detailed SARS-CoV-2 M protein membrane orientation was proposed, showing that residues 20-38 belong to TMH1, residues 46-70 to TMH2, and residues 76-100 to TMH3, results in agreement to the above-mentioned SARS-CoV experimental results. We also show a comparison between our equilibrated model structure against two other relevant predicted structures (Supplementary Figure S13). The overall conformation is similar, and the highest differences were found at TMH2 and TMH1, in particular in the length and linearity of TMH1. Supplementary Figures S2 and S6 clearly show that these regions are very stable through the MD simulation, further strengthening our chosen model 3D structure.

M Protein Dimer and Interface Prediction
Despite the M protein dimer being crucial for various biological functions such as SARS-CoV-2 virion assembly and shape formation [63], the type of interactions established in its homodimer form are still poorly understood [9,12]. Experimental SARS-CoV M protein dimer data demonstrated that residues W19, W57, P58, W91, L92, Y94, F95, and C158 were relevant, suggesting that homologous residues W20 (TMH1 domain), W58, and P59 (TMH2 domain), and W92, L93 Y95, and F96 (TMH3 domain) of SARS-CoV-2 may also be important for M dimer interaction and stabilization [13]. Authors also hypothesized that SARS-CoV residues C63, C85, and C158 mutations did not interfere  Table 7. Summary for predicted interfacial residues that interact with the first probe. In the second cluster (garnet), probes interacted with predicted M protein dimer interface residues P59, V66, W92, L62, L67, A85, and L93 from Monomer A and B and I82 from Monomer B. Similarly, to the highest scoring cluster, the mutations on residues that interact with the second cluster are also mainly not impactful on protein stability or impact it in a positive manner. Even though some mutations are more frequent in these residues, all in all they are still very conserved (Table 8). Probes from this cluster also interacted with residues V60, A63, A81, A83, C86, G89, L90, M91, and S94 from Monomer A and V60, A63, C86, G89, and L90 from Monomer B, but these are not predicted homodimer interacting residues. Other clusters from FTMap [43,[53][54][55] were not considered as their probes were not interacting in the predicted interface region. Table 8. Summary for predicted interfacial residues that interact with the second probe. Overall, the highest score cluster interacted with very conserved interfacial residues whereas the second highest cluster was in a region constituted by residues more prone to stabilizing mutations (especially mutations that alter the chemical character of the interface).

M Protein Monomer Structure and Membrane Orientation
At the moment, a few predictions for the M protein monomer were made as is the case for Heo and Feig [49], Zhang et al. [61], and AlphaFold [62]. In this work our starting point 3D model was the one developed by a state-of-the-art methodology, AlphaFold's, which was also determined as adequate and in consensus with other predictions [49]. We predicted its membrane orientation using six different membrane orientation softwares. After minimization and MD equilibration, we chose TMHMM M protein monomer membrane orientation prediction for the following studies since it showed a higher stability, with low RMSD values upon comparison with the initial AlphaFold's structure, and without any major conformational change. SARS-CoV M protein monomer domains were previously predicted in experimental research that elucidated M protein dimer interactions [13]. In that experiment, residues 15-37 were shown to belong to TMH1, residues 50-72 to TMH2 and residues 77-99 to TMH3 [13]. Herein, for the first time, a detailed SARS-CoV-2 M protein membrane orientation was proposed, showing that residues 20-38 belong to TMH1, residues 46-70 to TMH2, and residues 76-100 to TMH3, results in agreement to the abovementioned SARS-CoV experimental results. We also show a comparison between our equilibrated model structure against two other relevant predicted structures (Supplementary Figure S13). The overall conformation is similar, and the highest differences were found at TMH2 and TMH1, in particular in the length and linearity of TMH1. Supplementary Figures S2 and S6 clearly show that these regions are very stable through the MD simulation, further strengthening our chosen model 3D structure.

M Protein Dimer and Interface Prediction
Despite the M protein dimer being crucial for various biological functions such as SARS-CoV-2 virion assembly and shape formation [63], the type of interactions established in its homodimer form are still poorly understood [9,12]. Experimental SARS-CoV M protein dimer data demonstrated that residues W19, W57, P58, W91, L92, Y94, F95, and C158 were relevant, suggesting that homologous residues W20 (TMH1 domain), W58, and P59 (TMH2 domain), and W92, L93 Y95, and F96 (TMH3 domain) of SARS-CoV-2 may also be important for M dimer interaction and stabilization [13]. Authors also hypothesized that SARS-CoV residues C63, C85, and C158 mutations did not interfere with M dimer formation, suggesting that homologous SARS-CoV-2 M protein residues C64, C86, and C159 may also not be involved in M dimer interface [13]. A previous in silico approach proposed a M protein dimer structure based on four different templates (PDB IDs: 3A7K_A [64], 5UTT_A [65], 6SPB_V [66], and 6XDC [67]), and authors hypothesized the interaction was established between TMH1 and TMH2 [68]. These results differ from the experimental results on SARS-CoV that showed dimer interactions between monomers TMH2 and TMH3, as well as from other in silico studies suggesting that these regions interact [69]. However, other authors also attained agreement with experimental results with SARS-CoV dimer interactions [69]. Herein, SARS-CoV experimental information was used as cue for various docking experiments as already detailed in the Results section. A high confidence docking decoy based on the TMHMM monomer was subjected to further studies due to its proper membrane orientation regarding previous analysis. It was subjected to 1.5 µs MD, which showed that overall conformational stability for monomer A and monomer B was slightly different (e.g., dissimilar RMSDs), whereas RMSF results were alike, especially in TMH domains. TMHs showed low fluctuations, which allowed the establishment of highly prevalent and meaningful interactions between the two monomers. We identified 34 main interactions responsible for the M protein dimer 3D structure stabilization, between 17 residues from monomer A and 21 residues from monomer B. From these interactions, 73.53% occurred between transmembrane residues, which was expected as the M protein is a transmembrane dimeric system. From these interactions, 12 were conserved throughout the entire MD simulation time, including interactions between W92-W92, L93-P59, and F96-F96, homologous residues from the ones detected to SARS-CoV [13]. This suggests that these three interactions are pivotal towards M protein dimer stabilization. Other interacting residues were present in lasting interactions throughout the MDs simulations, and thus important residues to further study and validate were W55, V66, A69, V70, Y71, W75, I82, L93, F103, and M109.
Feig's laboratory has also proposed two possible configuration arrangements for the M protein homodimer [70], which were named by Monje-Galvan et al. as "open" and "closed" conformations [69]. The dimer structure configuration obtained in our study is similar to the "open" one (Supplementary Figure S14), and it is also supported by other authors, such as Cao et al. [27]. Monje-Galvan et al. [69] also described the interface region of the homodimer in the TMH2 and TMH3 regions [69], which also comes into agreement with our study.

M Protein Mutation Analysis
Regarding mutation analysis, from the 127,1550 genomes analyzed, 21,868 sequences carried SNPs at M protein dimer predicted interaction residues. This represents only 1.7% of all retrieved genomes suggesting that the predicted interfacial region is extremely conserved [71]. We identified 91 unique SNPs in this predicted interface. From these, 2.77% had a ∆∆G binding higher than 0.50 kcal/mol, which means that these mutations can have a negative impact in the M protein dimer stability and 12.27% had a ∆∆G binding lower than −0.50 kcal/mol, and, hence, could have a favorable impact in M protein dimer stability. Most mutations did not appear to influence M protein dimer interfacial stabilization, since about 85% showed ∆∆G binding values between −0.50 kcal/mol and 0.50 kcal/mol. The ones that seem to lead to a gain of stabilization are presented in Table 9. We included here I82T as it is very close to our established threshold and is the most prevalent detected mutation. Most SNPs remained as non-polar residues (55.53%) or transitioned from nonpolar to polar residues (41.68%) and most continued as non-aromatic residues. Since the M protein is a membrane protein, many non-polar residues were found within the membrane region, and, as such, most predicted interactions involved non-polar residues. However, mutations from non-polar to polar residues may confer a gain in conformation stability as they may establish hydrogen bonds. In our work, 9057 (99.36%) of non-polar to polar SNPs had ∆∆G binding negative values, which endorses the maintenance or increase in stability as proposed. Mutations in homologous SARS-CoV experimentally interacting residues P59, W92, L93, and F96 were sparse and showed ∆∆G binding values close to zero. Three exceptions were exposed: L93S and W92Q with ∆∆G binding values lower than −0.5 kcal/mol, suggesting that these residues were also extremely important for M protein dimer interaction; and L93P ∆∆G binding = 2.29 kcal/mol) value, the second highest, probably due to the destabilization caused by Proline in the TMH3 α-helix. Table 9. M protein mutations that resulted in a stabilization gain of the homodimer structure.

Mutation
∆∆G binding (kcal/mol) The most common mutations were I82T (28.88%) and V70L (28.82%), key residues for M monomers interaction as I82 and V70 interaction was conserved throughout the entire MDs simulation with a mean distance of 8.62 ± 0.65 Å for I82-V70 and 9.08 ± 0.66 Å for V70-I82 interactions (monomer A-monomer B). Both these residues (V70 and I82) had low RMSF values and were occluded from solvent upon complex formation (∆SASA values between 45-88 Å2), which protects the established interactions. I82T and V70L, showed ∆∆G binding values of −0.49 ± 0.38 kcal/mol and −0.02 ± 0.22 kcal/mol, suggesting that I82T is the most favorable, high-prevalent mutation and should be further studied. A previous work that established structure changes caused by M protein mutations, suggested that I82T and V70L mutations will not lead to a significant impact in M protein monomer secondary structure [72].
Overall, most represented clades in our mutation study were GRY (36.69%), containing VOC and GH (21.25%), G (19.06%), and GR (17.27%), containing VOC and VOI. This could mean that SNPs in the interface region may impact SARS-CoV-2 life cycle, specifically regarding the M protein functions. Furthermore, these mutations are intrinsically related to known VOC and VOIs. For instance, V70L and I82T mutations appeared in 99.5 and 97.64% of clades sequences that contain VOC and VOI. The most common mutation in VOC was V70L, detected in 6137 VOC genomes, and 97.35% of the time this mutation was detected, it appeared in pango lineage B.1.1.7, a VOC in clade GRY.
There were 25 co-occurring mutations on the GISAID data, 12 of which on interfacial residues involved in PPIs present throughout the entire MDs simulation. Even though SNP V70L only co-occurred with other mutations in nine cases, these sequences were from clade GRY, which contains several VOC. Overall, clades G (27.45%), GRY (23.53%), GH (23.53%), and GR (19.61%) were the most represented in our co-occurrence results, all containing VOC. V70L does not seem to be by itself relevant for homodimer formation but seems to be a catalyzer if co-occurring with other interfacial mutations as found in various VOCs. Clades GV (3.92%) and S (1.96%) also contained sequences with co-occurring mutations, and the remaining ones did not show any co-occurring mutations. It is possible to conclude that most co-occurring mutations were indeed in VOC and VOI containing clades.
One of the new strategies in drug development has been to develop peptides to interrupt transmembrane interactions in dimers [73]. As such, to predict druggability, regions of interest to the design of new drugs/peptides capable of inhibiting the formation of the M protein homodimer, we subjected 16 structures representative of the 16 clusters of the MD production phase to the well establish tool, FTMap [48,[58][59][60]. The highest-ranking cluster from FTMap showed key interactions with predicted interfacial residues. Probes on this cluster interacted with residues F96, F103, S108, S111, and F112, all highly conserved residues. In fact, this cluster only established interaction with three residues with a higher number of mutations: F100 (98), M109 (1088), and I97 (228). I97 and F100 were the only ones for which 50 to 30% of the known mutations led to a stabilization of the homodimer. Our results point to some crucial interactions established by these residues: F96-F96, F100-F96, F100-F112, F103-F103, F103-S108, F103-S111, and F103-F112 (Table 1, Figure 4). As such, this region seems to be the best candidate area for the development of a new drug/peptide to inhibit SARS-CoV-2 M protein dimer formation. This zone was also a promising target in another approach that searched for druggable targets in the homodimeric structure [63].
A computational docking approach recently proposed the M protein heterodimer interactions with E and S protein [16]. In this study, residues W55, F96, and F103 were predicted as interacting residues in M-E PPI and Y71, and Y75 as interacting residues in M-S PPI [16]. In our work, these residues were also shown as interacting residues in the M protein homodimer, F96 as well as F103 are also present in the best drug target candidate region. This promiscuous region serves as a good candidate for experimental drug/peptide validation not only for the M protein homodimer, but also for M-E and M-S protein heterodimers, which further promotes its importance for the SARS-CoV-2 virus formation.

Materials and Methods
This work can be split into three main steps: M protein monomer membrane orientation prediction, M protein dimer 3D structure prediction, and mutation effect assessment in the homodimer interface. The overall workflow to accomplish these goals is illustrated in Figure 1.

M Protein Monomer Structure and Membrane Orientation
As there are no experimentally resolved structures for SARS-CoV-2 M protein dimer or monomer, and protein homology to other known 3D structures is reduced, we used AlphaFold's [31] team proposed monomeric structure from YP_009724393.1 sequence. AlphaFold is a state-of-the-art Neural Network (NN)-based algorithm that predicts protein 3D structures from their sequence with a mean accuracy of 2.1 Å [74]. From all the 223 amino acids in the M protein, AlphaFold was able to confidently predict a structure encompassing residues 11 to 203, which were the ones studied and the results presented henceforth. M proteins can suffer glycosylation in order to regulate protein function [75,76], but this process has not yet been studied in detail [77]. To the best of our knowledge there is only one available in silico prediction of N5, N21, N41, N43, N117, N212, N203, and N216 as the N-glycosylation sites of the M protein [77]. However, other studies open the possibility that the M protein is not N-but instead O-glycosylated [78]. Due to the lack of confident experimental data and considering that the predicted residues are far away from the binding homodimeric interface for which we aimed to analyze potential gain/loss of stability, we decided to reduce the modeling uncertainty and neglect glycosylation at this stage. Six different web-based resources for membrane orientation prediction were used: OPM [32], TMpred [33], TMHM [34,35], PSIPRED [36,37], CCTOP [38,39], and SAC-SMEMSAT [40]. OPM database can predict protein structure within the lipid bilayer, and it optimizes position considering protein-membrane interactions [32]. TMpred predicts membrane-spanning regions and orientations from naturally occurring membrane proteins [33]. TMHMM correctly predicts membrane proteins' α-helices positions with an accuracy of 77%, differentiating between soluble and membrane proteins [34,35]. PSIPRED predicts membrane protein secondary structure based on position-specific scoring matrices [36,37]. CCTOP predicts transmembrane topology using known experimental and computational membrane topologies [38,39]. SACSMEMSAT can predict protein secondary structure and membrane protein topology from well-defined membrane protein data [40].
We used MD simulations for the M monomer initial minimization considering each membrane orientation obtained via OPM, TMpred and TMHMM, PSIPRED, CCTOP, and SACSMEMSAT. MDs were performed using GROMACS [42,43] and the CHARMM36 force field [79]. Each system was built with CHARMM-GUI [41] membrane builder with TIP3 waters, 0.9 M Na + and Cl − ions and a bilayer membrane with POPC:POPE:PI:POPS:PSM:Cholesterol, in order to replicate human ER membrane [80], as M protein is translated and virus is assembled in this organelle. System size, water molecules, ion numbers, and lipid composition are described in Supplementary Table S3. Systems initial minimization was performed to remove bad contacts using the steepest descent algorithm. In this step, systems were heated with a Berendsen thermostat at 310 K in the canonical ensemble (NVT) over 7 ns, an adequate temperature to use in SARS-CoV-2 M protein MD simulations [81]. Pressure was kept constant at one bar with an isothermal-isobaric ensemble (NPT) for 20 ns with a semi-isotropic pressure coupling algorithm [82]. Long-range electrostatic interactions were treated by the fast smooth Particle-Mesh Ewald (PME) method [83]. RMSD analysis was conducted in Pymol, version 1.2r3pre with protein and transmembrane Cα residues to establish structural differences between initial MD structure (AlphaFold M protein prediction) and membrane orientation equilibrated results.

M Protein Dimer and Interface Prediction
OPM, TMpred, and TMHMM protein monomers were selected from system equilibration results and subjected to M protein dimer prediction. To guide the protein-protein docking we used known information on SARS-CoV M protein that has a 90.5% sequence identity and 90% homology with SARS-CoV-2 M protein [28]. Two equilibrated M protein monomers from each membrane orientation were used for dimer prediction using the docking tool HADDOCK [44], version 2.4, a protein quaternary structure predictor based on experimental data. Since M protein is a membrane protein and most homodimers are symmetric [84], water docking results were not considered and docking results with TMH2 and TMH3 non-crystallographic symmetry restraints were generated. To determine M protein monomer's active residues, CPORT [85], a protein-protein residue interaction predictor at an atomic level, was used and only transmembrane residues predicted by this tool were considered for downstream steps. For each membrane predictor, 5000 dimer structures were generated in rigid body docking phase (it0) and 1000 structures for the semi-flexible refinement phase (it1). Monomer structures at the dimer docking decoys were superimposed with initial monomer membrane orientation prediction, and the 3D structures for which the angle between both superimposed monomer membranes was inferior to 1 • , overlaid membranes, were selected for further analysis. Upon the selection of the most 20 promising HADDOCK dimers 3D structures, we extended our work towards interface interacting residues prediction. Protein Interfaces, Surfaces and Assemblies (PISA) [86], a web-based tool that resorts to chemical-physical principles for analyzing and modeling of macromolecular interactions, was used as a first predictor for dimer interface residues on all 20 dimer structures. Two dimers were chosen based on PISA results and their comparison with SARS-CoV's M protein dimer experimental results, highlighted homologous SARS-CoV-2 residues W20, W58, P59, W92, Y95, F96, and C159 as important residues for dimer stabilization. Selected structures were further subjected to PRODIGY [87,88]. PRODIGY not only predicts dimer interacting residues, but also helps to determine if a protein interface is crystallographic or biological, the latter meaning that the predicted dimer is biologically relevant. The final dimer system was built in a similar way as above-mentioned for M protein monomer MD simulations [80] (Supplementary Table S3). Three independent dimer system replicas of 0.5 µs MD simulations were produced with GROMACS (production phase). M protein dimer equilibration was performed as described in the previous section. MD simulations were performed with an isothermal-isobaric ensemble. Temperature coupling was done using a Nose-Hoover thermostat with a time constant of 1 ps. To maintain a constant pressure, a semi-isotropic Parrinello-Rahman barostat was used with a time constant of 5 ps and compressibility of 4.5 × 10 -5 bar −1 . Electrostatic interactions were performed with fast smooth Particle-Mesh Ewald, with a cutoff of 1.2 nm and Hydrogen bonds were constrained using the linear constraint solver.
Dimer system RMSD (between initial MD structure from HADDOCK TMHMM model and membrane orientation equilibrated results) and RMSF calculations were performed using Cα atoms with GROMACS package. CCA, which tracks the movements of two or more sets of time series data relative to one another, was performed using the Bio3D R package [89] based on the Cα atoms. SASA analysis for each residue was performed with the GROMACS package SASA analyses were performed for the dimer complex (SASA complex ) and each monomer separately (SASA monomerA and SASA monomerB ), and ∆SASA was calculated for each residue as SASA complex − (SASA monomerA + SASA monomerB ). ∆SASA values provide another quantitative measure of conformational change upon protein coupling. To further understand the behavior upon complex formation, we also calculated rel SASA for each residue that comes from the quotient between ∆SASA and SASA monomer . To detect possible interacting residues, a structure was retrieved every 2 ns, totaling 100 structures from 300 ns until 500 ns, for each replica as further explored in Supplementary Figures S2-S4. These structures were then submitted to an in-house script that detected residues for which side chains were within 5 Å of each other, using a 90% prevalence time as a cut-off.

M protein Mutation Analysis
Genome and protein sequences for this study were obtained from the GISAID [51] database (Accession Numbers are listed at Supplementary Information) and are available upon request at https://www.gisaid.org. MicroGMT [45], a python package, was developed, optimized, and used for SARS-CoV-2 M gene mutation analysis, to track indels and SNPs. This software requires raw or assembled genome sequences and works through database comparison to detect genomic mutations. Only non-synonymous SNPs at the M gene region for predicted interacting residues were considered for further studies. For M protein sequence mutation analysis, we used the Rahman et al. approach that works through pairwise analysis and comparison [46]. This method uses Multiple Sequence Alignment (MSA) and pairwise alignments to detect mutations in large datasets in a fast and accurate manner and has also been used in other studies regarding different SARS-CoV-2 proteins. Both tools were used with default parameters and all available sequences were compared against a reference, the first SARS-CoV-2 genome sequenced (NC_045512.2).
To determine the impact of mutations in M protein dimer stability, Gibbs energy difference was calculated using FoldX [47], an empirical force field. This approach evaluates the impact of mutations in protein stability through free energy variation )(∆∆G binding = ∆G mutant − ∆G WT ) between mutant protein and reference protein, considering contributions from hydrophobic, polar, Van der Waals, hydrogen bonds, and electrostatic interactions [47]. To avoid considering mean ∆∆G binding values close to zero as relevant for protein stability, we established a low (below −0.5 kcal/mol) and high cut-off off (above 0.5 kcal/mol). Results for this step were analyzed considering residue polarities, both for the WT (protein sequence YP_009724393.1) and mutated proteins, as well as splitting residues by aromaticity, as both these characteristics have a major impact on protein-protein interactions. Residues considered as polar were R, N, D, C, E, N, H, K, S, T, Q, and Y; residues considered as non-polar were A, G, I, L, M, F, P, W, and V. Residues F, W, and Y were considered as aromatic.
To further identify ligand binding hotspots on SARS-CoV-2 M protein homodimer, an ensemble of representative structures was attained by clustering the production phase of the MD simulation. This clustering was performed by concatenating the trajectories and clustering with GROMACS using the gromos method with a cutoff of 0.25 nm. The 16 clusters were subjected to the FTMap [48,[58][59][60] tool using the default parameters. FTMap [48,[58][59][60] uses 16 small organic molecules as probes and samples/scores billions of positions to identify positions of interest to the development of new drugs.

Conclusions
As M protein dimer has several important functions during SARS-CoV-2 life cycle, it was fundamental to understand its structure-function relationship. Herein, upon establishing a comprehensive and well detailed computational pipeline, we were able not only to assess mutation effects at this interface but also to understand the specificities of the behavior of this region and establish the consequences for dimer stability. This was the first time that SARS-CoV-2 M protein dimer structure, interactions and mutational effects were proposed and thoroughly studied either computationally or experimentally. M protein is overall well conserved, showing that key-residues F96, F103, S108, S111, and F112 are preserved and able to form important interactions in the dimer. These residues can now be assessed as regions of interest for new therapeutic solutions regarding SARS-CoV-2.

Data Availability Statement:
The genomic datasets analyzed during the current study are freely available in the GISAID repository, at https://www.gisaid.org/ (accessed on 3 May 2021), and Accessions Numbers are available at Supplementary Information. GISAID has an application procedure for obtaining access to the data, which should be followed for any researcher that wants to use it. Detailed data analysis results are also available at Supplementary Materials. Any material requests should be addressed to I.S.M.: irina.moreira@cnc.uc.pt.