A Protein Co-Conservation Network Model Characterizes Mutation Effects on SARS-CoV-2 Spike Protein

The emergence of numerous variants of SARS-CoV-2 has presented challenges to the global efforts to control the COVID-19 pandemic. The major mutation is in the SARS-CoV-2 viral envelope spike protein that is responsible for virus attachment to the host, and is the main target for host antibodies. It is critically important to study the biological effects of the mutations to understand the mechanisms of how mutations alter viral functions. Here, we propose a protein co-conservation weighted network (PCCN) model only based on the protein sequence to characterize the mutation sites by topological features and to investigate the mutation effects on the spike protein from a network view. Frist, we found that the mutation sites on the spike protein had significantly larger centrality than the non-mutation sites. Second, the stability changes and binding free energy changes in the mutation sites were positively significantly correlated with their neighbors’ degree and the shortest path length separately. The results indicate that our PCCN model provides new insights into mutations on spike proteins and reflects the mutation effects on protein function alternations.


Introduction
Coronavirus disease 2019 , caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), has posed a serious global public-health emergency and was declared as a global pandemic by the World Health Organization (WHO) on 11 March 2020. The infection possesses a clustering onset and targets the respiratory framework of its host, causing influenza-like sickness with symptoms such as cough, fever, and then, in progressively serious cases, troubled breathing [1][2][3]. COVID-19 has spread rapidly over 200 countries and territories around the world, which has resulted in more than 661 million coronavirus cases and six million deaths (https://www.worldometers.info/coronavirus/, accessed on 24 December 2022).
SARS-CoV-2 has been confirmed to enter the host cells by contact between its receptorbinding domain (RBD) of the spike (S) protein and angiotensin converting enzyme 2 (ACE2) of the host cell [4,5]. Although substantial progress in clinical research has led to a better understanding of SARS-CoV-2 and the management of COVID-19, many countries have endured several waves of outbreaks due to the emergence of variants [6]. The S protein displays a high degree of genetic variability in the virus genome [7], and its variants make additional close contacts with ACE2, correlating with higher binding affinity and perhaps increased infectivity [8]. For example, the Omicron variant (B.1.1.529) can evade immunity from natural infection or vaccines and increase the risk of reinfections [9][10][11]. Acquiring a comprehensive knowledge of the S protein variants is critical for the determination of a defensive approach against COVID- 19.
The biological network approach, which is based on local and global topological features of the network, has been extensively used in various biological fields [12], and has also been applied in COIVD-19 related research. Deisy constructed a network medicine framework to identify potentially effective drug candidates [13]. Luisa et al. applied protein contact networks and found that the allosteric modulation region modulates the binding of the spike protein with ACE2 in different conformational states of the spike protein [7,14]. One of the main network-based strategies is to convert a protein into a network to study the structure, function, and dynamics of proteins [15][16][17]. However, most of the network models for protein function study require the 3D structures of proteins that can be experimentally determined, but the process is still laborious and time-consuming. In particular, the emergence of new variants of COVID-19 brings challenges for structurebased network models if the structures from the variants are not determined.
In this study, we propose a protein co-conservation weighted network (PCCN) model simply based on protein sequence to investigate the mutation effects on the spike protein ( Figure 1). We then compared the network topological features of the mutation and nonmutation sites at the residue and variant levels. Finally, we mined the correlation of topological features with the mutation effects on the spike protein stability changes and binding free energy changes.

Conservation and Co-Conservation of the Spike Protein
To calculate the conservation of the spike protein, the protein family was initially searched in InterPro [18] for the Spike glycoprotein, beta-coronavirus (IPR042578), and 11,372 related sequences were found. Among these sequences, those within the same species had high similarity, so only one of them was retained, leaving 263 sequences that were aligned to calculate the conservation (CS) and co-conservation (CCS) using ProCon [19,20].
The distribution of CS is shown in Figure 2. The left panel shows the conservation of each residue, which shows a trend of normal distribution. Notably, the scores in the range of 0.9 to 1.0 were particularly high. On the right, the CCS among residues is shown, which followed an overall trend of normal distribution, except that the CCS was small. The results demonstrated that the rate of residues with high conservation residues and low co-conservation was relatively high, which indicated that many of the residues in the spike protein were conserved, but their co-conservation was relatively weak. Next, an investigation into the conservation and co-conservation of mutation sites in variants that indicate a significant impact on transmissible, severity, and immunity was conducted. As shown in Figure 3, most variants had substitutions of D614G, followed by substitutions at sites 484E, 681P and 501N, which are relatively common and considered to increase infectivity with clear evidence [21][22][23][24][25][26][27][28][29][30]. Regarding the co-conservation of mutation sites in the variants, most of the variants had strong co-conservation site pairs, which may indicate that the strong co-conservation sites may play an important role in infectiousness. However, we also found that although some sites (such as 484E, 681P, and 417K) appeared more frequently, they did not have strong co-conservation with other sites.

Protein Co-Conservation Network Construction
A protein co-conservation weighted network of spike protein was constructed based on the CCS score that was calculated from the sequence of the protein. As shown in Figure 1, there were 1273 amino acids in the spike protein, which corresponded to 1273 nodes in the network. A total of 264,046 (16.3%) of all possible residue pairs had strong co-conservation as edges in the network, and the weights of the edges were CCS. Thus, the PCCN model could reveal the co-conservation information of residues from a network view.

Network Topological Features of PCCN of Spike Protein
Next, the topological features of PCCN were investigated and the correlation between these network features and the infectious or pathogenicity of SARS-CoV-2 was further explored. The network analysis was performed at two levels: residues and variants.

Mutation Sites Showed High Centrality in PCCN at the Residue Level
In the network, 89 nodes (residues) produced substitution mutations in the main variants. To compare the topological features of these 89 mutation nodes with the non-mutation nodes, we randomly selected 89 nodes from the non-mutation nodes and computed their average network parameters for 1000 times to present the network features of non-mutation nodes, which were then compared with the 89 mutation nodes. As shown in Figure 4, we calculated the average neighborhood weighted degree (Kw), page rank (P), degree centrality (D), betweenness (B), closeness (C), and shortest path length (L) and found that the mutation nodes had significantly larger Kw, P, D, B, but smaller L than the corresponding non-mutation nodes. The results indicate that the mutation nodes demonstrated more centrality than the non-mutation nodes and that they were topologically important in the PCCN. Moreover, the CS and CCS of the mutation nodes were significantly smaller than those of the non-mutation nodes, which showed that the conservation and co-conservation of the mutation nodes were relatively weak. Table S1 lists the detailed network feature information of the nodes. Finally, the top three mutation sites and co-mutation sites with the highest network parameters are listed in Table 1. Several substitutions had more than one high parameter such as 498Q, 440N, 516E, 1027T, and 614D. Then, we investigated these five substitution site relations with their allosteric modulation region (AMR) and RBD and found that 614D and 1027T had more interactions with both AMR and RBD ( Figure 5). The results indicate that there was more of a co-conservation tendency among 614D, AMR, and RBD, which may be the reason for the mutation of 614D and will allow for a modulated structural rearrangement for membrane fusion [31] and increase the infectivity and stability of virions [25].

Parameters
Top Three

. Mutation Sites Showed High Centrality in PCCN at the Variant Level
In this element of the study, 33 special variants were collected that have caused great harm to human society and are considered as highly infectious or virulent. To study the variants from a network view, we compared the network features of variants with those of the non-variants. For the variant group, the average parameter of mutation sites in each variant was used to present the feature for the variant; therefore, there were 33 values in the variant group for each parameter. For the non-variant group, random selection was applied to achieve the same number of non-mutation sites as each variant and computed the average parameter 1000 times. For example, there were five substitutions in the Beta variant, and five nodes were taken from the non-mutation sites 1000 times. The distribution of network characteristics for the same number of nodes is shown in Supplementary Figures S1-S8.
As a result, the variant group also showed significantly higher Kw, P, D, and B than the non-variant group ( Figure 6). The results indicated that the mutation sites that occurred on the same variants have high centrality and may play an important role in their conservation. In addition, the CS and CCS of mutation nodes in the variants were significantly smaller than the non-mutation nodes in the non-variant group, which showed that the conservation and co-conservation of mutation nodes were relatively weak at the variant level. Finally, Omicron BA.4 was taken, which contained the most mutation sites among the variants, as an example to perform the network analysis. As shown in Figure 7, similar to previous results, the CS and CCS of the mutation sites in BA.4 were significantly smaller than the non-variant group, which means that both the conservation and co-conservation of the mutation sites in the variant were lower. For the network parameters, only the shortest path length in Omicron and the non-variant group showed significant differences.

Stability Changes of Spike Protein upon Mutations Significantly Correlated with the Node Page Rank and Neighbors' Degree
To study the mutation effects for the spike protein function from a conservation network view, we performed the correlation analysis between the topological features of the mutation sites and mutation effects on the protein stability changes.
DeepDDG [32] was used to predict the changes in stability for each point mutation and each mutation pair based on the spike protein structure. As shown in Figure 8a, most of the mutations were destabilizing mutations, for example, G142D, E156G, and C136F were the top three mutations that obtained the smallest stability scores, but D138Y obtained the largest positive stability score, which means that the mutation increased the stability. Then, we conducted a correlation analysis between the topological features and the stability changes upon mutations and found that three topological parameters including page rank, average neighborhood weighted degree, and degree centrality were significantly correlated with stability changes (Table S2). In particular, there was a positive correlation between the page rank (P) and stability changes with r = 0.271 and p = 6.07 × 10 −3 (Figure 8b). A similar correlation pattern was also found between the average neighborhood weighted degree (Kw) and stability changes, as shown in Figure 8c (r = 0.253 and p = 1.10 × 10 −2 ). The results indicate that if the nodes themselves had a high degree of centrality, high page rank, or their neighbors had a high degree, their mutations tended to destabilize the spike protein.

Binding Free Energy Changes of Spike Protein upon Mutations Significantly Correlated with CCS and Average Shortest Path Length
The binding of the spike protein with the human receptor ACE2 plays a central role in the molecular machinery of the SARS-CoV-2 infection of human cells. Mutations in the spike protein affect the binding affinity with ACE2 and have been established to be proportional to the infectivity of different viral variants in the host cells [4,[33][34][35][36]. Therefore, we analyzed the relationship between the PCCN topological features and binding free energy (BFE) changes upon mutation to investigate the variant effects on protein binding from a network view based only on the sequence information.
TopNetTree [37] was employed to estimate the BFE changes upon the mutations listed in Table 2 of the SARS-CoV-2 spike protein with ACE2 (PDB: 7A98 [38]). As shown in Figure 9a, all of the mutations increased the BFE between the spike glycoprotein and ACE2. The changes in BFE were significantly negatively correlated with the closeness centrality with the p-value of 0.045 and the coefficient of −0.207. Next, we analyzed the co-mutation effects on the spike protein. The correlation of the shortest path length and CCS between two mutations with the BFE changes of the two mutations were calculated, respectively. As shown in Figure 9b,c, the BFE changes were significantly positively correlated with CCS (r = 0.233 and p = 1.00 × 10 −3 ) and shortest path length (r = 0.179 and p = 6.15 × 10 −8 ) from the two mutation sites. The results demonstrate that if the two mutation sites have the tendency of high co-conservation or larger distance, they might induce larger BFE changes. Moreover, the top three co-mutation pairs with the largest BFE were D215G-D614G, S371F-D614G, and D614G-T732A. Notably, D614G was involved in all of them, which has been reported to have an important impact on infectivity [21][22][23][24][25][26][27]; although it is not in the RBD, it has an important impact on BFE. The results indicate that our PCCN model based on sequence information could also reflect the mutation effects on the binding affinity changes between the spike protein and ACE2. The detailed correlation parameters are listed in Table S3. VOC: variants of concern; DEV: de-escalated variants.

Discussion
Over the past years, the constant emergence of new variants of SARS-CoV-2 has presented challenges in the prevention of the spread and treatment of COVID-19. Studying the mutation effects on the spike protein from SARS-CoV-2 remains important. In this study, we proposed a sequence-based network model called PCCN that contains conservation and co-conservation information to study the mutation effects on the spike protein from a network view. We found that the mutation sites in the PCCN not only demonstrated lower conservation and co-conservation tendencies than the non-mutation sites, but also showed significant topological differences such as the average neighborhood weighted degree and page rank, both from non-mutation sites at the residue level and variant level. The results mean that the PCCN model could capture the topological features of mutation sites and thus has the potential to predict the mutation sites from non-mutation sites.
The stability and binding free energy of the spike protein were correlated with SARS-CoV-2 infectivity [4,[33][34][35][36]. Thus, we analyzed the correlation between the topological features of PCCN and changes in the stability and binding free energy of spike protein upon mutations. The stability changes were significantly positively correlated with the node page rank and neighbors' degree, while BFE changes were positively correlated with the shortest path length, and correlated negatively with the closeness centrality. The results indicate that our PCCN model could reflect the changes in the physico-chemical characteristics of the spike protein upon mutation.
However, this study had two potential limitations. First, the limited number of variants may result in bias for the topological feature comparison analysis between the variants and non-variants. Second, the relationship between the topological features and spike protein function needs further validation in large datasets or experimental verification. In future study, we will continue to collect the variants to explore and enhance the application of PCCN for the SARS-CoV-2 study.
Taken together, the sequence-based PCCN presented here not only suggests topological differences in mutation sites from non-mutations, but also provides new insights into the mutation effects on the spike protein physico-chemical characteristics without any protein structure information. Moreover, the PCCN was only based on the sequence information, which means that it can be quickly migrated to study the effects of continuously emerging variants of SASR-CoV-2. This is an adaptive way to face the emergence of new strains or viruses in the future.

Dataset
Since the outbreak of SARS-CoV-2, many variants have emerged on the spike protein that are highly infectious and have caused great destruction to the world such as Omicron [39,40], Beta [41][42][43][44], Gamma [44][45][46], and Delta [47]. In this paper, we focused on the variants provided by the ECDC. A total of 48 variants were collected from the ECDC website (https://www.ecdc.europa.eu/en/covid-19/variants-concern, accessed on 9 July 2022). The variants were divided into four types: variant of concern (VOC), variant of interest (VOI), variant under monitoring (VUM), and de-escalated variants (DEV). Then, 33 variants with currently available evidence of increasing transmissibility, immunity, or infection severity were kept for further analysis ( Table 2). The substitutions of variants were selected from [48]. A total of 101 substitutions were observed that corresponded to 89 distinct mutation sites.

Conservation Calculation
The conservation score (CS) of each residue and the co-conservation score (CCS) between residues in the spike protein were calculated by ProCon [19,20], which is based on multiple sequence alignment (MSA) and information theory. The sequences for MSA were obtained from the InterPro database [18] by looking for its protein family to find similar sequences in different species. Then, sequences were aligned by ClustalX2 [49] and used as ProCon input to compute the spike protein conservation information.
The raw CS that was calculated by ProCon is between 0 and 3, and the CCS between residues is between 0 and 350. The scores were normalized by a min-max normalization method as shown in Equation (1), and they were mapped between 0 and 1: where x max and x min are the maximum and minimum scores, respectively. Co-conservation scores for all possible residue pairs can be calculated by ProCon, and the larger the score, the stronger the co-conservation tendency. However, most of the residues pairs had weak co-conservation. We tried to keep more residue pairs with relatively strong co-conservation to construct the network. As shown in Figure 2, when the co-conservation score was smaller than 0.268, the number of residue pairs increased sharply, which means that they had less of a tendency of co-conservation. Therefore, residue pairs with CCS larger than 0.268 were selected as strong co-conservation.
In order to simplify the representation of the score, the CS and CCS denote the normalized conservation score and co-conservation score, respectively, in the later section.

Protein Co-Conservation Network Construction
The spike protein was represented as a network consisting of a set of nodes and edges. Each amino acid residue is presented as a node. The node set V is defined as follows: where i is the residue id in the protein; v i is the corresponding residue (node) for I; and N is the number of residues in the protein.
An edge was placed between two residues if the CCS between them was larger than 0.268. Thus, the adjacent matrix W of the PCCN was defined as follows: When two nodes had strong co-conservation, that is, the co-conservation score was greater than or equal to the threshold of 0.268, there was an edge between the nodes and the weight of the edge was the CCS. The links between adjacent residues along the backbone were also kept and their weights were set at the threshold 0.268 if the CCS between them was less than the threshold. Thus, the PCCN is a network model that contains co-conservation information among residues.

Network Analysis
After constructing the PCCN, we analyzed the network topological characteristics of the residues in the network. Six network characteristics were used in our work: average neighborhood weighted degree (Kw), page rank (P) [50], closeness centrality (C), betweenness centrality (B), degree centrality (D), and shortest path length (L). For node v i , the parameters Kw, D, B, and C are defined as follows: where n i is the of adjacent nodes number of v i , and σ(s, t) is the shortest path length from node s to node t.
Page rank was originally designed to measure the importance of the website pages. Here, it was used to measure the importance of a node [50]. While it is an algorithm of directed graphs, each edge of the conservative interaction network will be converted to two directed edges. The shortest path length is the length of the path with the minimum sum of edge weights. Since all nodes are connected, there is an average shortest path between any pair. These network parameters were implemented by NetworkX [51].

Estimations of Binding Free Energy (BFE) Change
The change in binding free energy (∆∆G) upon a single mutation from variants was estimated using TopNetTree [37]. The effects of two substitutions on the BFE changes were evaluated as follows: where ∆∆G i and ∆∆G j are the binding free energy changes following mutation i and j, respectively; and ∆∆G ij is the combined changes of both mutations i and j.

Estimations of Stability Changes
The effect of the amino acid substitutions on stability was calculated using Deep-DDG [32]. Similar to the binding free energy changes of two substitutions, the effects of two mutations on the stability changes were defined as follows: where S i and S j are the stability changes following mutation i and j and S ij is the combined changes of stability following the mutation i and j. The structure of the spike protein was obtained from the website of Zhang's laboratory (https://zhanggroup.org/COVID-19 /index.html#table1, accessed on 30 October 2022) with the identification code QHD43416 that was generated by the D-I-TASSER/C-I-TASSER pipeline [52].

Statistics
The Spearman's correlation coefficient was used to analyze the correlation between the protein properties and network characteristics. The Wilcoxon rank-sum test was applied to analyze the statistically significant differences and was implemented by SciPy [53].

Conclusions
In conclusion, we proposed a network model known as PCCN to study the mutation effects on the spike protein only from the sequence information. The topological features of PCCN showed a significant difference at both the residue and variant levels. In particular, the page rank and average neighborhood weighted degree could reflect the mutation effects on the spike protein stability, while the co-conservation score and shortest path length could indicate the mutation effects on the binding free energy changes. The PCCN model offers a new approach for studying variants that may cause future pandemics based on the sequence and topology information.

Conflicts of Interest:
The authors declare no conflict of interest.