Next Article in Journal
Towards Quantum-Chemical Level Calculations of SARS-CoV-2 Spike Protein Variants of Concern by First Principles Density Functional Theory
Previous Article in Journal
Severe COVID-19 Illness and α1-Antitrypsin Deficiency: COVID-AATD Study
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Computational Analysis Predicts Correlations among Amino Acids in SARS-CoV-2 Proteomes

by
Emmanuel Broni
1 and
Whelton A. Miller III
1,2,*
1
Department of Medicine, Loyola University Medical Center, Loyola University Chicago, Maywood, IL 60153, USA
2
Department of Molecular Pharmacology & Neuroscience, Loyola University Medical Center, Loyola University Chicago, Maywood, IL 60153, USA
*
Author to whom correspondence should be addressed.
Biomedicines 2023, 11(2), 512; https://doi.org/10.3390/biomedicines11020512
Submission received: 16 January 2023 / Revised: 3 February 2023 / Accepted: 8 February 2023 / Published: 10 February 2023
(This article belongs to the Section Immunology and Immunotherapy)

Abstract

:
Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is a serious global challenge requiring urgent and permanent therapeutic solutions. These solutions can only be engineered if the patterns and rate of mutations of the virus can be elucidated. Predicting mutations and the structure of proteins based on these mutations have become necessary for early drug and vaccine design purposes in anticipation of future viral mutations. The amino acid composition (AAC) of proteomes and individual viral proteins provide avenues for exploitation since AACs have been previously used to predict structure, shape and evolutionary rates. Herein, the frequency of amino acid residues found in 1637 complete proteomes belonging to 11 SARS-CoV-2 variants/lineages were analyzed. Leucine is the most abundant amino acid residue in the SARS-CoV-2 with an average AAC of 9.658% while tryptophan had the least abundance of 1.11%. The AAC and ranking of lysine and glycine varied in the proteome. For some variants, glycine had higher frequency and AAC than lysine and vice versa in other variants. Tryptophan was also observed to be the most intolerant to mutation in the various proteomes for the variants used. A correlogram revealed a very strong correlation of 0.999992 between B.1.525 (Eta) and B.1.526 (Iota) variants. Furthermore, isoleucine and threonine were observed to have a very strong negative correlation of −0.912, while cysteine and isoleucine had a very strong positive correlation of 0.835 at p < 0.001. Shapiro-Wilk normality test revealed that AAC values for all the amino acid residues except methionine showed no evidence of non-normality at p < 0.05. Thus, AACs of SARS-CoV-2 variants can be predicted using probability and z-scores. AACs may be beneficial in classifying viral strains, predicting viral disease types, members of protein families, protein interactions and for diagnostic purposes. They may also be used as a feature along with other crucial factors in machine-learning based algorithms to predict viral mutations. These mutation-predicting algorithms may help in developing effective therapeutics and vaccines for SARS-CoV-2.

1. Introduction

Proteins play crucial roles in many biological processes in organisms and they remain a major source of druggable targets [1]. Protein homologs usually have similar structures and functions [2]. In addition, proteins belonging to the same family and having related shapes tend to interact with other molecules in similar ways. Proteins bind to other molecules in order to complete specific tasks, which is highly influenced by the way the protein’s exposed surfaces interact with the molecules. The proteome of an organism consists of the proteins that are or can be expressed by a cell, tissue or organism, including intrinsically disordered proteins (IDP) or regions (IDR) [3,4].
Proteins are formed by long chains of amino acids which are connected via covalent peptide bonds [5,6]. Recent studies have analyzed the amino acids that make up proteins and have linked the sequence and composition to the protein’s function and shape. A protein’s amino acid composition (AAC) can be defined as the percentages of each amino acid in the sequence of that protein [7,8]. The AAC and sequence of amino acid residues are responsible for the native structure and functionality of a protein in an environment. Various factors can influence the AACs of proteins in organisms. The guanine-cytosine (G-C) content is the most important determinant of AAC [9,10]. The habitat of an organism also plays a key role in its AAC [11,12,13]. This is evident in adaptation of organisms to different habitats and landscapes [13]. Previous studies have proposed the use of relative AAC as a signature of a habitat or environment although the AAC is not highly determined by the environment as compared to the G-C content of the organism [13].
Various studies have successfully developed models that predict the structural class of proteins by using the AAC [14,15] since there is a strong correlation between AAC and structural class of a protein [16,17]. Sufficient knowledge about the structural class of proteins is a major boost for protein structure prediction, which in turn has significant medical and pharmaceutical implications [18]. Other studies have also suggested exploiting AAC for predicting protein-protein interactions [19,20]. Protein secondary structure can also be predicted using the protein’s AAC along with evolutionary information [21].
The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has caused a global health crisis and seems to be unending. The rate of mutation of the virus is increasing the challenges of finding suitable therapeutic options [22,23,24,25]. Virus mutation is a natural process that occurs when the virus replicates. Errors can occur in different parts of the virus’s genome during replication, resulting in changes to the genetic code of the virus. Some mutations can result in changes to the virus’s antigenicity [26,27,28], making it more or less recognizable to the immune system, while others can affect its transmissibility [29,30,31,32], virulence, or ability to evade host immune response [26,27,28]. Mutations can arise spontaneously [33,34,35] or through selective pressure including exposure to antiviral drugs (antiviral resistance) [36,37] or even host immune responses (antigenic drift) [38,39,40]. Some mutations can have a significant impact on the virus’s biology and behavior, while others have little or no effect. The RNA virus, SARS-CoV-2, uses RNA as its genetic material and RNA replication has been shown to be less accurate than DNA replication [41]. The low fidelity of the RNA-dependent RNA polymerase used by SARS-CoV-2 to replicate its genetic material [42] leads to replication errors and higher mutational loads responsible for the emergence of variants [43]. Most RNA viruses lack a proofreading domain thus the increased frequency of mutations [44,45,46,47]. However, the SARS-CoV-2 has an exoribonuclease (ExoN) found on the non-structural protein 14 (nsp14) which it uses for proofreading although it still has its limitations [45,48,49,50], evident in the very high mutation rate of the virus [51,52].
RNA recombination is another molecular mechanism the SARS-CoV-2 virus could use for mutation, as a high frequency of recombination events occur in coronaviruses [53]. A case of intra-host SARS-CoV-2 recombination during a coinfection by the Delta (AY.33) and Gamma (P.1) has been reported [54]. A total of six recombinant regions across the genome within the same sample were identified: four in the spike gene and two in the nucleocapsid gene [54]. Computational analysis also predicted the possibility of future recombination events between SARS-CoV-2 and the Middle East respiratory syndrome coronavirus (MERS-CoV) RNA [55]. The accumulation of multiple mutations over time can result in the emergence of new viral variants. Understanding these molecular mechanisms, pattern and rate at which the virus mutates will be a boost to producing therapeutic molecules that will target future variants.
The living world is full of patterns which may not be easily identified [56,57]. These patterns may follow mathematical principles such as the Hardy-Weinberg Law [58], Benford’s Law [59,60], among others. Proteins also have patterns in various regards, and these patterns may be guided by mathematical or statistical functions and principles [61]. These patterns could be greatly influenced by the relationships shared among the amino acids that make up the protein. Herein, we sought to investigate the relationship that exists among amino acid residues in the various SARS-CoV-2 proteomes. The correlations may provide insights into the mutual constrained conditions of amino acids in the various SARS-CoV-2 variants during evolution.

2. Materials and Methods

2.1. Retrieving Amino Acid Sequences

FASTA files containing the amino acid sequences of the various SARS-CoV-2 proteins were retrieved from the National Center for Biotechnology Information (NCBI) database (https://www.ncbi.nlm.nih.gov accessed on 10 September 2022) using the keywords SARS-CoV-2, complete genome and the “Pango lineage” [62]. The sequences were retrieved using the Pango lineage nomenclature classification of the various SARS-CoV-2 variants [63] which were obtained from CoVariants (https://www.covariants.org/variants accessed on 10 September 2022) and the World Health Organization (WHO) Tracking SARS-CoV-2 variants (https://www.who.int/activities/tracking-SARS-CoV-2-variants accessed on 10 September 2022).

2.2. Data Pre-Processing

Based on Pango lineage, the amino acid sequences were split into separate files. All lines in each text or FASTA file that do not contain amino acid residues were removed using the “sed” command on linux terminal. Each file was then stripped of newlines and whitespace characters in order to combine all the sequences as one continuous string. The files were then read and saved into “DataFrames” based on their lineages using Pandas, an open source data analysis and manipulation tool for Python [64,65]. All strains that had non-standard amino acid residues in their proteome sequences were removed from the data.

2.3. Analysis

2.3.1. Determining Amino Acid Frequencies

For each strain (or sample), the frequencies and AACs of the standard amino acids were determined using the “counter” function in Python and saved into “Dataframes”. The average AACs per variant were also determined by calculating the percentages of each residue per variant and for the virus as a whole. Furthermore, the standard deviations were computed and stored. The “Dataframes” were then exported as Excel workbooks.

2.3.2. Correlation Analyses

Correlation analyses were performed on the SARS-CoV-2 variants and the amino acid residues to investigate their relationships based on their AACs. R studio was employed to generate the correlation plots for the amino acid residues after loading the Excel workbooks. The “Hmisc” package in R was used to generate the correlation coefficients with their significance levels (p-values) by using the “rcorr” function. The “corrplot” package was then used to plot the correlograms.

2.3.3. Test for Normality

The study further tested the normality of each amino acid composition’s distribution using the Shapiro-Wilk test, skewness, and kurtosis. Using the “shapiro” function in the “scipy” library, the test statistic and corresponding p-values were determined. The “skewtest” and “kurtosis” functions were employed to compute the skewness and kurtosis of the distribution, respectively. P-P plots were also generated to visualize the normality of the amino acid AAC distributions.

2.4. Validating Correlation Analysis

A multiple template alignment tool, WebPRANK was employed to align a maximum of 3 proteome sequences from each variant/lineage [66]. For variants that had fewer than 3 proteome files, all the available sequences were used. In all, a total of 28 sequences were aligned: three each from B.1.1.7, B.1.351, P.1, B.1.617.2, B.1.525, B.1.526, C.37, and BA.1, two from BA.4, and one each from B.1.177 and B.1.160. The samples are labelled as “A”, “B” and “C” to enable readability of the analysis. The output from the alignment was then compared to the AAC analyses.

3. Results and Discussion

Based on the search via the National Center for Biotechnology Information (NCBI) database, the data obtained were grouped into 12 variants (Table 1) comprising of:
  • B.1.1.7 (Alpha)
  • B.1.351 (Beta)
  • P.1 (Gamma)
  • B.1.617.2 (Delta) and B.1.617.1 (Kappa)
  • B.1.525 (Eta)
  • B.1.526 (Iota)
  • C.37 (Lambda)
  • BA.1 and BA.2 (both Omicron)
  • BA.4 (Omicron)
  • BA.5 (Omicron)
  • B.1.177
  • B.1.160.
All proteome sequences that had non-standard or ambiguous amino acid residues such as Xaa (unspecified or unknown amino acid, X), Glx (glutamine or glutamic acid, Z), Asx (asparagine or aspartic acid, B), and Xle (leucine or isoleucine, J) were removed from the dataset. All 23 proteome sequences of BA.5 had one or more of the non-standard amino acid residues thus, were not included in this study (Table 1). For B.1.1.7 (Alpha), 706 out of the 830 were used in this study. For the B.1.526 (Iota) variant, 773 out of the 3085 passed this criterion and were used in the study. Variants B.1.525, B.1.160, and B.1.177 did not have any of the ambiguous amino acids in any of the datasets. In all, a total of 1637 complete SARS-CoV-2 proteome sequences belonging to 11 variants/lineages were used for the analysis (Table 1).

3.1. The Range of Amino Acid Residues in the SARS-CoV-2 Samples

The number of amino acid residues in the SARS-CoV-2 proteomes used herein ranged from 13,925 to 14,153. Most B.1.1.7 samples were observed to either have a total amino acid residue count of 14,045 or 14,149 while all 4 samples of B.1.351 had a total residue count of 14,140. All the 13 B.1.525 samples had a total residue count of 14,138. For B.1.526, most of the samples had either 14,143 or 14,142 residues. Other B.1.526 samples had total amino acid residue counts of 14,045, 14,149, 14,138, 14,147, 14,147, 14,149 or 14,140. For B.1.617.2, two samples had a total residue count of 13,926, while most of the samples had either 14,149 or 14,145. Other B.1.617.2 samples had 14,026 or 14,139 and one sample had 14,140 residues. The BA.1 samples used herein had total residue counts ranging from 14,129 to 14,145, while the two samples of BA.4 had 14,129 and 14,135 residues. All eight C.37 samples had a total residue count of 14,136. Computing the complete amino acid count considering all 17 proteins in the SARS-CoV-2 proteome revealed a total of ~14,439 residues. The inconsistency in the total residue counts obtained herein may be due to deletions and insertions in various SARS-CoV-2 genome nucleotides, resulting in fewer or more amino acid residues in the proteins, respectively [67,68]. For instance, an 81 base-pair deletion was observed in a SARS-CoV-2 case in Arizona, resulting in a 27 amino acid deletion in the Open reading frame 7a (ORF7a) protein [67]. Several other deletions have been reported in the SARS-CoV-2 spike protein [38,69,70] and non-structural protein 1 (nsp1) [71,72,73]. Amino acid deletions in the other SARS-CoV-2 proteins have also been highlighted in literature [68,74]. In addition, some SARS-CoV-2 ORF proteins such as ORF3c (~41 amino acids) are translated during infection [75,76]. The sample data obtained from NCBI did not contain sequences of ORF9b (9b) [~97 amino acids], ORF9c (9c) [~73 amino acids], ORF3b (3b) [~22 amino acids], ORF3c (3c) [~41 amino acids], and ORF3d (3d) [~57 amino acids] thus, the lower number of amino acid residues. Without these 5 proteins, the amino acid count is ~14,149, consistent with those obtained herein.

3.2. Amino Acid Frequencies per Variant

The frequencies of the 20 standard amino acids in each of the 1637 complete SARS-CoV-2 proteomes were determined. For each of the 11 variants, the average frequencies of the amino acid residues were also determined (Table 2). The average amino acid frequencies of all SARS-CoV-2 variants along with their standard deviations were also computed using the mean frequencies of each variant (Figure 1A and Table S1). The variant averages were used in order to avoid biases owing to the uneven data sets of the variants (the uneven number of samples for each variant). It was observed that leucine had the highest mean frequency of 1365.131 (Table S1) in the SARS-CoV-2 proteomes, ranging from 1362 (BA.4) to 1370 (B.1.177) [Table 2]. Tryptophan was observed to have the least mean frequency of 156.895 (Table S1). Variants B.1.525, C.37, BA.1, BA.4, B.1.177, and B.1.160 were observed to have 157 tryptophans in their proteomes while B.1.1.7, B.1.351, P.1, B.1.617.2, and B.1.526 had, on average, 156.469, 156.75, 156.959, 156.721, and 156.951 tryptophans, respectively. On average, the B.1.617.2 variant was observed to have the most Met count of 312.512, while B.1.160 had the least (309) [Table 2].
The standard deviation provides a measure to determine the dispersion of the data from the mean [77]. The standard deviation of the amino acid frequency provides information on the extent of mutation by each residue in the various variants. On average, threonine was observed to have the highest standard deviation of 3.743 followed by isoleucine, glycine, proline, phenylalanine, lysine, and serine with 3.11, 2.467, 2.421, 2.357, 2.337, and 2.296, respectively (Table S1), signifying the relatively higher involvement of these residues in SARS-CoV-2 mutations. Tryptophan, cysteine and methionine on the other hand, were observed to have the least standard deviations of 0.175, 0.864, and 0.95, respectively (Table S1), implying that they are less involved in SARS-CoV-2 mutations.

3.3. Highest and Least Represented Amino Acids in the SARS-CoV-2 Proteomes

Due to the inconsistency in amino acid frequencies, the AAC for the various SARS-CoV-2 variants were determined (Figure 2 and Table 3). The AAC provides a more accurate metric to compare the amino acids in the various SARS-CoV-2 proteomes since it deals with the percentage of each residue in the proteome. The AAC caters for the inconsistencies in residue frequencies caused by insertion or deletion of genomic nucleotides. Proteome-wide analysis of the various SARS-CoV-2 variants showed that Leu was the most abundant amino acid with an average AAC of 9.658% while Trp was the least with an average AAC of 1.110% (Figure 1B and Table 3). Leucine is the most abundant amino acid in proteomes [78] and has been shown to be the most abundant in cyanobacterial proteomes [79] and even plant proteomes from 144 plant species [80]. Herein, valine, threonine, alanine and serine are the other highly abundant amino acids with average AACs of 8.149, 7.493, 6.831, and 6.739%, respectively (Figure 1B, Figure 2 and Table 3). In plant proteomes, Trp was also observed to be the least encoded [80]. Other low abundant SARS-CoV-2 amino acids in addition to Trp include His, Met, and Cys with average AAC values of 1.873, 2.205, and 3.070%, respectively (Table 3). Trp and Cys have previously been reported to be the least abundant in proteins found in Swissprot and TrEMBL databases [78]. A previous study showed that cysteine abundance positively correlates with the complexity of the organism although cysteine is underrepresented in all organisms [81]. Cysteine occurrence in Thermus aquatcus, Haloarcula maresmortui, Escherichia coli, Saccharomyes cerevisiae, Drosophila, and human proteins were 0.4, 0.5, 1.1, 1.3, 1.9, and 2.3%, respectively [81]. Although coronaviruses are structurally complex [82,83,84], the classification of viruses as organisms is highly debated. Investigating the cysteine occurrence in different viruses is necessary to determine if viral complexity (among viruses) also correlates with the cysteine content. On average, almost 55.61% of the SARS-CoV-2 proteome were observed to be non-polar amino acids while 44.39% were polar (Table 3).
The average AACs per variant were also determined in order to analyze the highest and lowest abundant amino acids for each variant. The AACs can help determine the relationship between variants since AACs are signatures for organisms and environments [13]. Variant B.1.617.2 had the most abundant Leu, Ala, Gly, and Met with average AACs of 9.688, 6.846, 5.951, and 2.215%, respectively. Variants B.1.617.2, C.37, and BA.4 had the highest Trp content in their proteomes with an AAC of 1.111%. SARS-CoV-2 variant C.37 was observed to have the most abundant Glu, Ser, Val, and Asn with AACs of 4.817, 6.768, 8.168, and 5.429%, respectively. Variant BA.4 was observed to have the highest abundant Lys (5.958%), His (1.889%), Cys (3.078%), Tyr (4.568%), Ile (5.215%), and Arg (3.418%). For residues Pro, Phe, and Gln, variants B.1.526, BA.1, and P.1 had the highest abundance with corresponding AACs of 3.94, 5.025, and 3.66% (Table 3). For Thr, variants B.1.177 and B.1.160 had the highest abundance with an AAC of 7.52%. Both B.1.177 and B.1.160 also had similar Ser, Pro, Arg, Asp, Cys, and Trp compositions of 6.75, 3.937, 3.4, 5.103, 3.067, and 1.11%, respectively. They both had the most abundance for Asp (5.103%). These could imply that both B.1.177 (EU1) and B.1.160 (EU2) share the same ancestry as confirmed by the phylogenetic relationships of Nextstrain SARS-CoV-2 clades, reporting that B.1.177 (EU1) is an offspring of EU2. The signature A222V mutation of the spike protein of B.1.177 (20E EU1) variant was identified in two genomes of B.1.160 (20A EU2) [85]. Moreover, both variants were first identified in Europe and there is no evidence of increased viral transmissibility between them [86].

3.4. Tryptophan Less Likely to Mutate (Intolerant to Mutation)

Comparing the averages of the various variants, it was observed that tryptophan maintained an average frequency of 156.895 (157) and an average AAC of 1.11% in the SARS-CoV-2 proteome. Variants B.1.1.7, B.1.351, P.1, B.1.617.2, and B.1.526 had tryptophan average frequencies of 156.469, 156.75, 156.959, 156.721, and 156.951, respectively, while the rest maintained the 157 average count in their proteomes (Table 2). Variants P.1, B.1.525, B.1.526, BA.1, B.1.177, and B.1.160 had ~1.110% of their proteomes to be tryptophan, B.1.617.2, C.37, and BA.4 had 1.111%, while B.1.1.7 and B.1.351 had 1.109% as tryptophan (Table 3). Tryptophan was observed to be relatively equal in all the SARS-CoV-2 variants analyzed herein. Previous studies have shown that tryptophan in different SARS-CoV-2 proteins are conserved and replacing them might make the proteins unstable [87,88]. This may be due to the fact that tryptophan is encoded by only one codon, UGG. The paucity of the UGG codon in the genome makes it difficult to allow for any mutations, which might result in highly unstable proteins in the proteome as a whole. A recent study has shown that mutating tryptophan in the neucleocapsid (N) protein makes the virus very unstable [87]. Tryptophan has been reported to play an important role in the structural stability of proteins [89,90].

3.5. Lysine and Glycine Counts and Ranking

Sorting the amino acid frequencies and AACs in descending order revealed that the positions of lysine and glycine varied in the proteomes of the SARS-CoV-2 variants. Glycine and lysine had mean values of 838.220 and 838.047 with standard deviations of 2.467 and 2.337, respectively, for all SARS-CoV-2 variants. For variants B.1.1.7, B.1.160, B.1.177, B.1.351, B.1.525, B.1.526, B.1.617.2, and P.1, glycine had higher counts (ranked 6th most abundant) than lysine (7th most abundant) and vice versa for variants BA.1, BA.4, and C.37. For variants BA.1, BA.4, and C.37, lysine was the 6th most abundant while glycine was the 7th. Variants BA.1 and BA.4 both belong to the omicron variant of the SARS-CoV-2 while C.37 is a lambda variant. Lambda has been reported to share a common ancestry with omicron in contrast with delta [91]. Omicron and lambda also share common mutations in the genomic regions of ORF1b, spike protein and ORF8 [92]. The difference in lysine and glycine counts in the various SARS-CoV-2 variants necessitates the investigation of their effects in viral transmission, replication, survival, and other mechanisms. Lysine mutation (K417N and E484K) in the spike protein has been shown to significantly influence Angiotensin-converting enzyme 2 (ACE2) binding [93,94], suggesting lysine’s critical role in viral entry. An in silico structure-based energy calculation suggests that mutations at Gly431, Gly648, and Gly35 may cause massive destabilizing effects on the full-length spike protein [95]. In addition, variants with the D614G mutation were observed to be more dominant and had higher transmissibility than those with Asp614 by enhancing ACE2 binding affinity [31,96].

3.6. Correlation Analysis

A correlation matrix was generated for the SARS-CoV-2 variants using their AACs (Figure 3 and Table S2). The correlation coefficients among the variants ranged between 0.999895 and 1, supporting the fact that all the SARS-CoV-2 variants share a high degree of closeness (Table S2). The pair with the least correlation involved BA.4 and B.1.177 with a coefficient of 0.999896 while variants B.1.525 and B.1.526 had the highest correlation (0.999992) [Table S2]. B.1.525 (Eta) and B.1.526 (Iota) variants were both first discovered in New York in November 2020 and have been previously shown to have similar spike mutations including E484K and D614G [97]. B.1.351 (Beta, 20H) and B.1.526 (Iota, 21F) variants also demonstrated a very strong correlation of 0.99999, supporting existing studies that show that they both share a common ancestry (20C). Herein, the phylogenetic relationship of SARS-CoV-2 clades generated by Nextstrain [98] was used (available at https://github.com/nextstrain/ncov-clades-schema accessed on 12 December 2022) [Figure S2]. BA.4 [Omicron (22A)] and B.1.177 [20E (EU1)] were observed to have the least correlation of 0.999896. BA.4 was originally detected in South Africa [99,100] while B.1.177 was identified in Spain [86]. Furthermore, the phylogenetic tree by Nextstrain shows that BA.4 and B.1.177 are highly unrelated among the SARS-CoV-2 variants [Figure S2].
B.1.177 had relatively lower correlations with C.37 and BA.1 with coefficients of approximately 0.999915 and 0.999944, respectively (Table S2). Variants C.37, BA.1, and BA.4 were also observed to be the least correlated to the other variants (Figure 3). Variants B.1.1.7 and P.1 were observed to be highly correlated with a correlation coefficient of 0.999991. B.1.1.7 is the alpha variant [V1 (20I)], which shares a common ancestry [B.1.1 (20B)] with P.1 [gamma variant, V3 (20J)].
The B.1.526 variant was closest to the mean variant with the highest correlation coefficient of 0.999997 (Table S2). This result is consistent with the determined amino acid frequencies and compositions (Table 2 and Table 3). The AACs of B.1.526 and the mean were very similar while all other variants seem to deviate from the mean. For residues serine, proline and tryptophan, B.1.526 had the same AACs of 6.739, 3.934, and 1.110%, respectively, with the mean. Other residues including Met, Glu, Leu, Val, Phe, Asn, Lys, Arg, Asp, Ala, Cys and Tyr had very similar AACs for both B.1.526 and the mean. However, for proline and threonine the difference in AAC between B.1.526 and the mean were equal to or greater than 0.01%. For proline, B.1.526 and the mean had AACs of 3.934 and 3.924%, respectively, while 7.489 and 7.493%, respectively, were observed for Thr. B.1526 can be suitable as a reference for mutation studies, since it is closest to the mean and all the other variants deviate from the mean. Variants B.1.525, B.1.1.7, and B.1.351 also had very high correlation with the mean with coefficients of 0.999994, 0.999993, and 0.999991, respectively (Table S2). BA.4 was observed to be the least correlated to the mean with a coefficient of 0.999958 (Table S2).
C.37 had correlation coefficients of 0.999941 and 0.999942 with B.1.1.7 and P.1, respectively (Table S2). This is not surprising as C.37 has been shown to be an offspring of B.1.1.1, which in turn is an offspring of B.1.1. Alpha variant (B.1.1.7) and gamma variant (P.1) are also direct offspring of B.1.1. Surprisingly, the AAC of C.37 (lambda) had high correlation with those of B.1.526, B.1.351 and BA.1 with coefficients of 0.999964, 0.999956, and 0.99995, respectively (Table S2).
A correlation plot of the 20 standard amino acids was also generated to investigate the relationship among the residues in all the SARS-CoV-2 proteomes (Figure 4 and Table S3). Correlation coefficients closer to ±1 signify strong correlation [101]. Thus, values greater than 0.7 or less than −0.7 were considered in this study. A previous study that investigated the correlation among amino acids in 204 proteins argued that a correlation coefficient of 0.5 could be considered as strongly correlated since data on structural classes of proteins are a fuzzy set [102]. Herein, the p-values were also determined to ascertain whether the correlations are statistically significant. At p < 0.001, only four correlation relationships [Cys-Gly, Ile-Thr, Asn-Phe (negative correlations) and Cys-Ile (positive correlation)] were observed (Figure 4B). At p < 0.01, a total of 13 relationships existed among the amino acid residues (Figure 4C). Residue pairs Gly-Thr, Ser-Asn, Lys-Trp, Glu-Val, and Cys-Ile were strongly correlated at p < 0.01 while Cys-Leu, Cys-Gly, Cys-Thr, Ile-Leu, Ile-Gly, Ile-Thr, Asn-Phe, and His-Gln were negatively correlated (Figure 4C). At p < 0.05, a total of 32 correlation relationships were observed among the amino acid residues (Figure 4D).
Threonine had the strongest negative correlation with isoleucine (−0.912) [Figure 4 and Table S3] implying that any mutation that produces a threonine will most likely result in an isoleucine mutating to another amino acid across the proteome of SARS-CoV-2. Although this does not imply that threonine necessarily mutates to isoleucine (or vice versa), several threonine to isoleucine substitutions have been reported in various SARS-CoV-2 proteins [103]. The threonine to isoleucine mutations have been shown in silico to modify the hydrophobicity and structure [103].
In addition, isoleucine had a very strong negative correlation with leucine (−0.818) and glycine (−0.720) while sharing a very strong positive correlation with cysteine (0.835) [Table S3]. Similar trends were observed for cysteine, sharing very strong negative correlations with Leu (−0.779), Thr (−0.784), and Gly (−0.897) and a strong positive correlation with Val (0.704), all at p < 0.05 (Table S3). Glycine and threonine were also observed to be strongly correlated with a coefficient of 0.726 (Table S3). Other residue pairs worth mentioning include Val-Glu (0.741), Asn-Ser (0.714), and Thr-Gly (0.726) which had strong correlations while Phe-Asn (−0.853) and His-Gln (−0.78) had strong negative correlations (Table S3).

3.7. Test for Normality Using Shapiro and Predicting Probability Using Z-Score

Due to the small variant size (11), the Shapiro-Wilk test was used to determine the normality of the distributions of the various AACs. Normality of the distributions will enable the use of z-scores in predicting the AACs (and possibly frequencies) in future variants or lineages. The skewness and kurtosis of the amino acid residue distributions were also determined (Table 4). All the amino acid residues except methionine had p-values greater than 0.05 (Table 4). Met was observed to depart significantly from normality (W = 0.728, p-value < 0.05) [Table 4]. However, at p < 0.001, Met does not show evidence of non-normality. Met also had the greatest skewness and kurtosis with values of 2.860 and 2.968, respectively. Serine had the most normal distribution (W = 0.973, p-value = 0.922) followed by histidine (W = 0.968, p-value = 0.865) and glutamine (W = 0.947, p-value = 0.610) [Table 4]. Valine (W = 0.922, p-value = 0.336) and lysine (W = 0.938, p-value = 0.496) had the least skewness of 0.040 and 0.050, respectively (Table 4).
Normal probability plots were also generated to analyze how the AAC data deviate from a theoretical distribution (Figure 5 and Figure S3). The probability plot, similar to the quantile-quantile (Q-Q) plot, shows the distribution of the data in comparison to the expected normal distribution. The data is expected to lie along a straight line provided the data is normally distributed. However, if the data is non-normal, a curve is observed, deviating from the straight line. Outliers are the data points that can be seen distant from majority of the points or further away from the straight line. The P-P plot confirmed serine’s normality (Figure 5A) as well as the other amino acids (Figure S3) except methionine. Methionine was observed not to be normally distributed corroborating the results from the Shapiro-Wilk test (Figure 5B). However, with a larger data size, methionine may obey the normality tests.
Generally, the normality of the amino acid distribution allows the prediction of AACs of SARS-CoV-2 variants using probability and z-scores. The difference in amino acid composition (variant AAC − mean AAC) (Figure 6A) and standard scores (z-scores) [Figure 6B] were determined with regards to the amino acid residues for the various SARS-CoV-2 variants. The difference between the variant AACs and the average AAC ranged between −0.04 and 0.042 (Figure 6A and Table S4). The highest difference was demonstrated by the isoleucine of variant BA.4, having 0.042% more than the average isoleucine composition. The z-score can be calculated by using Equation (1).
z-score = (variant AAC − mean AAC)/(std of the AAC),
where “std of the AAC” is the standard deviation of a specific amino acid composition (Table S5).
The z-scores of the 20 AACs for all the 11 variants ranged between −2.7 and 2.1 (Figure 6B) representing 0.3 to 98 % chance. The z-score can help predict the most likely combinations of amino acids in emerging variants. For example, from the above equation, there is a 99.29% chance of having a variant with 1.112% Trp since z-score will be 2.448 (~2.45). In addition, there will be a 0.59% chance of having a SARS-CoV-2 variant with 1.108% tryptophan composition since the z-score will be equal to −2.523 (~−2.52). This is not surprising as previous studies have shown that mutating Trp in the SARS-CoV-2 makes the virus highly unstable [87,88]. Moreover, 1.12% is closer to the average AAC (1.11) as compared to 1.08%. The AACs can be beneficial in predicting future viral proteomes which can help in drug discovery, vaccine and protein-protein interaction studies.

3.8. Validation of Correlation Analysis

A multiple template alignment of 28 proteome sequences from the 11 variants/lineages used herein were performed and analyzed using webPRANK [66]. The variants which had more than two or three samples in the alignment are labelled as “A”, “B” or “C” in addition to the variant/lineage name to differentiate among them. The lengths of the 28 sequences from the 11 variants/lineages ranged from 14,026 to 14,149 residues while the sequence matrix length was 14,155 columns. The ORF1ab spanned from position 1 to 7096, while ORF1a was from 7097 to 11,501. The spike protein followed to position 12,780 and the ORF3a was from 12,781 to 13,055. The envelope protein’s alignment ended at 13,130, followed by the membrane protein which ended at 13,352, and then ORF6 from 13,353 to 13,413. The alignments of ORF7a and ORF7b ended at positions 13,534 and 13,577, respectively. The ORF8 spanned from 13,578 to 13,698 while the nucleoprotein ended at position 14,117. The last protein on the alignment was ORF10 which ended at position 14,155.
The alignment revealed various deletions across the SARS-CoV-2 proteomes. For the ORF1ab, 82GHV84 (BA.1A), 141KSF143 (BA.4B and B.1.617.2B), 2083S (BA.1B and BA.1C) deletions were observed. The 3675SGF3677 was also deleted in all the samples except for B.1.160, B.1.617.2A, B.1.617.2B, B.1.617.2C, BA.1A, B.1.526A, B.1.177, and B.1.1.7B. Similar deletions were observed in the ORF1a region of the same samples. For the spike region, 24LPP (BA.4A and BA.4B), 69HV (B.1.1.7A, B.1.1.7C, B.1.525A, B.1.525B, B.1.525C, BA.1B, BA.1C, BA.4A, BA.4B, and P.1A), 143VYY (BA.1B and BA.1C), 144Y (P.1A, B.1.1.7A and B.1.1.7C), 145Y (B.1.525A, B and C), 156EF (BA.1A, B.1.617.2A, B.1.617.2B and B.1.617.2C), 217NLVR (BA.1B and BA.1C), 247LLA (B.1.525A, B and C), 252RSYLTPG (C.37A, B and C), and 681QTQTN (B.1.617.2A) deletions were observed. The 211IVREPE was only present in samples BA.1B and BA.1C instead of 217NLVR. 211IVREPE insertion has been previously identified in the BA.1 variant [104]. There was also a 2F deletion in all three samples of the B.1.525 variants in the ORF6 region.
The B.1.617.2C is an ORF8-deleted sample. ORF8-deleted SARS-CoV-2 variants have been shown clinically to have higher transmissibility with milder disease outcomes [105,106,107,108,109]. The ORF8 has been suggested to be involved in pathogenesis and not viral genome replication [110]. Samples B.1.1.7A, B.1.1.7C, and P.1A were also observed to have partial deletions of the ORF8, from 27Q till the end of the ORF8. Deletions affecting the length of the ORF8 is not strange as they have been previously identified [109]. The 119DF120 deletion which has been mentioned in an earlier study [111], was also observed for B.1.617.2A, B.1.617.2B, and BA.1A samples. For the N protein, 3D (B.1.525A, B and C) and 31ERS33 (BA.1B, BA.1C, BA.4A and BA.4B) deletions were observed.
From the alignment of the 28 samples, Trp was observed to be involved in only one mutation (W131L of ORF3a protein) for B.1.351A corroborating the AAC results which suggest that Trp is less likely to mutate (Table 3). Aside W131L, Trp131 has also been reported to mutate to other forms including W131S, W131R, and W131V [112,113]. Herein, all the other 27 samples used for the alignment demonstrated no Trp mutations.
A total of 74 mutations involving Thr or Ile were recorded, with 49 of these being direct Thr-to-Ile or Ile-to-Thr mutations, making up 66.22% of the total instances. Interestingly, there were other indirect mutations involving Thr and Ile. Aside Thr, Ile was observed to only mutate to/from three residues comprising of Met, Val, and Leu. Across the complete proteome, isoleucine was observed to share 5, 4, and 2 mutations with Val, Met, and Leu, respectively. Threonine was also observed to share 6, 4, 3, and 1 mutation(s) with Ala, Asn, Lys, and Arg, respectively. From the earlier correlogram, Thr was predicted to have a negative correlation with Ala and Lys at p < 0.05 (Figure 4D), consistent with the alignment results. For the ORF1ab protein, 20 out of 27 (74.07%) mutation instances were Thr-to-Ile or Ile-to-Thr. For the envelope, ORF7a, ORF7b and ORF8 proteins, only one mutation each was observed involving Thr or Ile and they were directly linked. For the N protein, 3 out of 5 mutations (60%) involving Thr or Ile were directly from Thr to Ile.
Cys was observed to mutate to and from Gly in all samples except for BA.1C, BA.4A and BA.4B. In the N protein, G214C (all three C.37 samples) and G215C (BA.1A and B.1.617.2A) mutations were present while C316G (BA.1B) mutation was observed in the ORF1a. R5716C in the ORF1ab (same as R392C in nsp13) [114,115] was observed for BA.4A and BA.4B. In addition, C655Y is observed in both ORF1ab and ORF1a for BA.1C. The results from the alignment show that there is a 66% chance that a Cys will be formed by a mutation from a Gly, as only one out of the three instances Cys was formed by another residue (Arg). A larger data set is required to confirm these findings. However, the same cannot be said for Gly since Gly was observed to also mutate to/from Asp, Ser, Arg, Asn, Val, and Ala. The correlogram also predicted Gln and His to be negatively correlated. From the alignment analysis, Gln and His shared 7 mutations: two in ORF1ab (Q676H and Q5412H), one in ORF1a (Q676H), two in spike (Q683H and Q960H), one in ORF3a (Q57H) and one in the N protein (Q9H). Gln shared 5, 3, 1, and 1 mutation(s) with Arg, Leu, Glu, and Lys, respectively. His was also observed to share 6, 3, and 2 mutations with Tyr, Pro, and Asp, respectively. Although the correlation analysis predicted a strong negative correlation between Asn and Phe (Figure 4), there were no direct mutations between the two residues. They could probably be indirectly related.

3.9. Limitations of the Study

Similar to all other research, this study also has its shortcomings and limitations. The relatively small sample size used herein is a major shortcoming of this study since they may not accurately reflect the overall frequency of each amino acid in the proteomes although the use of AACs could help mitigate this challenge. Furthermore, the limited number of variants used (11) does not provide complete analysis of all SARS-CoV-2 variants. The uneven sample sizes of the variants could also influence the overall accuracy of the results reported herein. Nonetheless, this study reports correlations among the amino acid residues that can be further investigated on a larger sample size to corroborate these findings. Robust and more accurate strategies/methods for validating the correlations must be developed in order to identify or measure indirectly linked residues.
Furthermore, the study did not investigate the correlation existing among the amino acids taking epistasis into consideration. A mutation in one protein can influence another mutation in a different protein of the same organism since proteins often interact with each other to perform specific functions, and changes in one protein can impact the behavior or function of others. Efforts are ongoing to elucidate epistasis in the SARS-CoV-2 genome as this phenomenon is still in its nascent stages for this virus [116,117]. A recent study showed that several pairwise epistases exist among eight viral genes [116]. The impact of one mutation on another can be complex and may depend on a variety of factors, including the specific proteins involved, the context of the mutations, and the environment. Understanding these interactions is important for understanding the evolution of viruses and for developing effective strategies to control the spread of infectious diseases. This molecular mechanism also contributes to the evolution of SARS-CoV-2 and the emergence of new viral variants [116]. Understanding these SARS-CoV-2 mutation mechanisms is important for developing effective strategies to control the spread of COVID-19. Future studies could evaluate the correlations existing among the amino acid residues taking into consideration epistatically linked pairs.

3.10. Potential Implications of the Study and Future Perspective

This study showed that relationships exist among the amino acid residues in the SARS-CoV-2 proteome. These relationships in addition to knowledge from molecular mechanisms on viral mutations and environmental factors, can help in designing lab-made mutants for mutational studies. Previous studies have successfully predicted structural classes of proteins using AAC-based models [14,15]. Although predicting viral mutations are complex due to the many factors involved, optimizing machine-learning based models which use up-to-date resources can improve the success rate, which will in turn help in drug and vaccine research and design. The AACs along with evolutionary information can be exploited to predict protein secondary structures [21], and later for protein synthesis, study protein-protein and protein-drug interactions [19,20]. For diagnostics purposes, the AACs can help classify strains or variants since each variant has specific AACs for the amino acid residues.

4. Conclusions

This study showed that correlation relationships exist among the 20 standard amino acids in the SARS-CoV-2 proteome. Herein, the frequencies of the amino acid residues in 1673 complete samples belonging to 11 SARS-CoV-2 variants were carefully computed and analyzed. The AACs revealed that leucine and tryptophan were the most and least abundant, respectively. Correlation analysis also showed that cysteine and isoleucine have very strong negative correlations with leucine, glycine, and threonine at p < 0.01. Threonine and isoleucine were observed to have the strongest negative correlation of −0.912 at p < 0.001. At p < 0.001, isoleucine and cysteine had a very strong positive correlation of 0.835. Special attention must be given to these amino acids when predicting mutations in the SARS-CoV-2. In addition, we show that the AAC is able to predict the ancestry and relationships among SARS-CoV-2 variants. The AAC analysis and correlations reported herein will aid in predicting and classifying viral strains. The AACs of the variants were also observed to be normally distributed which can be leveraged to predict future proteome contents using probability and z-scores.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/biomedicines11020512/s1, Figure S1: Bar plots showing the average frequencies of the amino acid residues belonging to SARS-CoV-2 variants (A) B.1.1.7, (B) B.1.351, (C) P.1, (D) B.1.617.2, (E) B.1.525, (F) B.1.526, (G) C.37, (H) BA.1, (I) BA.4, (J) B.1.177, and (K) B.1.160; Figure S2: Phylogenetic relationships among SARS-CoV-2 clades as determined by Nextstrain; Figure S3: Probability plots to determine the normality for amino acid residues (A) Alanine, (B) Cysteine, (C) Aspartic acid, (D) Glutamic acid, (E) Phenylalanine, (F) Glycine, (G) Histidine, (H) Isoleucine, (I) Lysine, (J) Leucine, (K) Asparagine, (L) Proline, (M) Glutamine, (N) Arginine, (O) Threonine, (P) Valine, (Q) Tryptophan, and (R) Tyrosine; Table S1: Average and standard deviation of each amino acid for all the SARS-CoV-2 variants used in this study; Table S2: Correlation among the SARS-CoV-2 variants using their amino acid compositions; Table S3: Correlation matrix among the 20 standard amino acids using their compositions; Table S4: Calculated z-scores of the amino acid residues belonging to the 11 SARS-CoV-2 variants using their compositions. Z-score values are presented in 3 decimal places; Table S5: Average and standard deviation of the amino acid compositions.

Author Contributions

W.A.M.III and E.B. conceptualized and carried out the project. The manuscript was written through contributions of both authors. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

The authors are grateful to the editor and the anonymous reviewers for their comments.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Chai, T.; Tian, M.; Yang, X.; Qiu, Z.; Lin, X.; Chen, L. Association of Circulating Cathepsin B Levels With Blood Pressure and Aortic Dilation. Front. Cardiovasc. Med. 2022, 9, 762468. [Google Scholar] [CrossRef] [PubMed]
  2. Sternke, M.; Tripp, K.W.; Barrick, D. Consensus sequence design as a general strategy to create hyperstable, biologically active proteins. Proc. Natl. Acad. Sci. USA 2019, 116, 11275–11284. [Google Scholar] [CrossRef] [PubMed]
  3. Lermyte, F. Roles, Characteristics, and Analysis of Intrinsically Disordered Proteins: A Minireview. Life 2020, 10, 320. [Google Scholar] [CrossRef]
  4. Owen, I.; Shewmaker, F. The Role of Post-Translational Modifications in the Phase Transitions of Intrinsically Disordered Proteins. Int. J. Mol. Sci. 2019, 20, 5501. [Google Scholar] [CrossRef] [PubMed]
  5. Kumar, R.; Dhanda, S.K. Bird Eye View of Protein Subcellular Localization Prediction. Life 2020, 10, 347. [Google Scholar] [CrossRef]
  6. Arapoglou, D.; Labropoulos, A.; Varzakas, T. Enzymes Applied in Food Technology. In Advances in Food Biochemistry; CRC Press: Boca Raton, FL, USA, 2009; pp. 101–129. ISBN 9781420007695. [Google Scholar]
  7. Saidijam, M.; Azizpour, S.; Patching, S.G. Comprehensive analysis of the numbers, lengths and amino acid compositions of transmembrane helices in prokaryotic, eukaryotic and viral integral membrane proteins of high-resolution structure. J. Biomol. Struct. Dyn. 2018, 36, 443–464. [Google Scholar] [CrossRef]
  8. Bogatyreva, N.S.; Finkelstein, A.V.; Galzitskaya, O.V. Trend of Ino Acid Composition of Proteins of Different Taxa. J. Bioinform. Comput. Biol. 2006, 4, 597–608. [Google Scholar] [CrossRef]
  9. Lobry, J.R. Influence of genomic G + C content on average amino-acid composition of proteins from 59 bacterial species. Gene 1997, 205, 309–316. [Google Scholar] [CrossRef]
  10. Bharanidharan, D.; Ramya Bhargavi, G.; Uthanumallian, K.; Gautham, N. Correlations between nucleotide frequencies and amino acid composition in 115 bacterial species. Biochem. Biophys. Res. Commun. 2004, 315, 1097–1103. [Google Scholar] [CrossRef]
  11. Wang, G.-Z.; Lercher, M. Biased amino acid composition in warm-blooded animals. Nat. Preced. 2009. [Google Scholar] [CrossRef]
  12. Cabello-Yeves, P.J.; Rodriguez-Valera, F. Marine-freshwater prokaryotic transitions require extensive changes in the predicted proteome. Microbiome 2019, 7, 117. [Google Scholar] [CrossRef] [PubMed]
  13. Moura, A.; Savageau, M.A.; Alves, R. Relative Amino Acid Composition Signatures of Organisms and Environments. PLoS ONE 2013, 8, e77319. [Google Scholar] [CrossRef] [PubMed]
  14. Tang, L.; Wang, J. Predicting enzyme class with Rough Sets. In Proceedings of the 2016 9th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI), Datong, China, 15–17 October 2016; pp. 1832–1836. [Google Scholar]
  15. Bao, W.; Chen, Y.; Wang, D. Prediction of protein structure classes with flexible neural tree. Biomed. Mater. Eng. 2014, 24, 3797–3806. [Google Scholar] [CrossRef] [PubMed]
  16. Taguchi, Y.; Gromiha, M.M. Application of amino acid occurrence for discriminating different folding types of globular proteins. BMC Bioinform. 2007, 8, 404. [Google Scholar] [CrossRef] [PubMed]
  17. Ofran, Y.; Margalit, H. Proteins of the same fold and unrelated sequences have similar amino acid composition. Proteins Struct. Funct. Bioinform. 2006, 64, 275–279. [Google Scholar] [CrossRef] [PubMed]
  18. Zhu, L.; Davari, M.D.; Li, W. Recent Advances in the Prediction of Protein Structural Classes: Feature Descriptors and Machine Learning Algorithms. Crystals 2021, 11, 324. [Google Scholar] [CrossRef]
  19. Roy, S.; Martinez, D.; Platero, H.; Lane, T.; Werner-Washburne, M. Exploiting Amino Acid Composition for Predicting Protein-Protein Interactions. PLoS ONE 2009, 4, e7813. [Google Scholar] [CrossRef]
  20. Higa, R.H.; Tozzi, C.L. A simple and efficient method for predicting protein-protein interaction sites. Genet. Mol. Res. 2008, 7, 898–909. [Google Scholar] [CrossRef]
  21. Lee, S.; Lee, B.; Kim, D. Prediction of protein secondary structure content using amino acid composition and evolutionary information. Proteins Struct. Funct. Bioinform. 2005, 62, 1107–1114. [Google Scholar] [CrossRef]
  22. Majumdar, P.; Niyogi, S. SARS-CoV-2 mutations: The biological trackway towards viral fitness. Epidemiol. Infect. 2021, 149, e110. [Google Scholar] [CrossRef]
  23. Malik, J.A.; Ahmed, S.; Mir, A.; Shinde, M.; Bender, O.; Alshammari, F.; Ansari, M.; Anwar, S. The SARS-CoV-2 mutations versus vaccine effectiveness: New opportunities to new challenges. J. Infect. Public Health 2022, 15, 228–240. [Google Scholar] [CrossRef] [PubMed]
  24. Cosar, B.; Karagulleoglu, Z.Y.; Unal, S.; Ince, A.T.; Uncuoglu, D.B.; Tuncer, G.; Kilinc, B.R.; Ozkan, Y.E.; Ozkoc, H.C.; Demir, I.N.; et al. SARS-CoV-2 Mutations and their Viral Variants. Cytokine Growth Factor Rev. 2022, 63, 10–22. [Google Scholar] [CrossRef] [PubMed]
  25. Mercatelli, D.; Giorgi, F.M. Geographic and Genomic Distribution of SARS-CoV-2 Mutations. Front. Microbiol. 2020, 11, 1800. [Google Scholar] [CrossRef] [PubMed]
  26. Hossain, M.G.; Mahmud, M.M.; Nazir, K.H.M.N.H.; Ueda, K. PreS1 Mutations Alter the Large HBsAg Antigenicity of a Hepatitis B Virus Strain Isolated in Bangladesh. Int. J. Mol. Sci. 2020, 21, 546. [Google Scholar] [CrossRef]
  27. Zhou, W.; Xu, C.; Wang, P.; Anashkina, A.A.; Jiang, Q. Impact of mutations in SARS-COV-2 spike on viral infectivity and antigenicity. Brief. Bioinform. 2022, 23, bbab375. [Google Scholar] [CrossRef]
  28. Li, Q.; Wu, J.; Nie, J.; Zhang, L.; Hao, H.; Liu, S.; Zhao, C.; Zhang, Q.; Liu, H.; Nie, L.; et al. The Impact of Mutations in SARS-CoV-2 Spike on Viral Infectivity and Antigenicity. Cell 2020, 182, 1284–1294.e9. [Google Scholar] [CrossRef]
  29. Zhang, X.; Li, Y.; Jin, S.; Zhang, Y.; Sun, L.; Hu, X.; Zhao, M.; Li, F.; Wang, T.; Sun, W.; et al. PB1 S524G mutation of wild bird-origin H3N8 influenza A virus enhances virulence and fitness for transmission in mammals. Emerg. Microbes Infect. 2021, 10, 1038–1051. [Google Scholar] [CrossRef]
  30. Mair, C.M.; Ludwig, K.; Herrmann, A.; Sieben, C. Receptor binding and pH stability—How influenza A virus hemagglutinin affects host-specific virus infection. Biochim. Biophys. Acta Biomembr. 2014, 1838, 1153–1168. [Google Scholar] [CrossRef]
  31. Ozono, S.; Zhang, Y.; Ode, H.; Sano, K.; Tan, T.S.; Imai, K.; Miyoshi, K.; Kishigami, S.; Ueno, T.; Iwatani, Y.; et al. SARS-CoV-2 D614G spike mutation increases entry efficiency with enhanced ACE2-binding affinity. Nat. Commun. 2021, 12, 848. [Google Scholar] [CrossRef]
  32. Yan, D.; Wang, B.; Shi, Y.; Ni, X.; Wu, X.; Li, X.; Liu, X.; Wang, H.; Su, X.; Teng, Q.; et al. A Single Mutation at Position 120 in the Envelope Protein Attenuates Tembusu Virus in Ducks. Viruses 2022, 14, 447. [Google Scholar] [CrossRef]
  33. Li, J.; Du, P.; Yang, L.; Zhang, J.; Song, C.; Chen, D.; Song, Y.; Ding, N.; Hua, M.; Han, K.; et al. Two-step fitness selection for intra-host variations in SARS-CoV-2. Cell Rep. 2022, 38, 110205. [Google Scholar] [CrossRef] [PubMed]
  34. Dodds, W.J. Coronavirus SARS-CoV-2 (COVID-19) and Companion Animal Pets. J. Immunol. Allergy 2020, 1, 1–3. [Google Scholar] [CrossRef] [PubMed]
  35. Laine, P.; Nihtilä, H.; Mustanoja, E.; Lyyski, A.; Ylinen, A.; Hurme, J.; Paulin, L.; Jokiranta, S.; Auvinen, P.; Meri, T. SARS-CoV-2 variant with mutations in N gene affecting detection by widely used PCR primers. J. Med. Virol. 2022, 94, 1227–1231. [Google Scholar] [CrossRef] [PubMed]
  36. Correia, V.; Abecasis, A.B.; Rebelo-de-Andrade, H. Molecular footprints of selective pressure in the neuraminidase gene of currently circulating human influenza subtypes and lineages. Virology 2018, 522, 122–130. [Google Scholar] [CrossRef] [PubMed]
  37. Prachanronarong, K.L.; Canale, A.S.; Liu, P.; Somasundaran, M.; Hou, S.; Poh, Y.-P.; Han, T.; Zhu, Q.; Renzette, N.; Zeldovich, K.B.; et al. Mutations in Influenza a Virus Neuraminidase and Hemagglutinin Confer Resistance against a Broadly Neutralizing Hemagglutinin Stem Antibody. J. Virol. 2019, 93, e01639-18. [Google Scholar] [CrossRef]
  38. Lau, S.-Y.; Wang, P.; Mok, B.W.-Y.; Zhang, A.J.; Chu, H.; Lee, A.C.-Y.; Deng, S.; Chen, P.; Chan, K.-H.; Song, W.; et al. Attenuated SARS-CoV-2 variants with deletions at the S1/S2 junction. Emerg. Microbes Infect. 2020, 9, 837–842. [Google Scholar] [CrossRef]
  39. Ning, T.; Nie, J.; Huang, W.; Li, C.; Li, X.; Liu, Q.; Zhao, H.; Wang, Y. Antigenic drift of influenza A(H7N9) virus hemagglutinin. J. Infect. Dis. 2019, 219, 19–25. [Google Scholar] [CrossRef]
  40. Yewdell, J.W. Antigenic drift: Understanding COVID-19. Immunity 2021, 54, 2681–2687. [Google Scholar] [CrossRef]
  41. Carey, L.B. RNA polymerase errors cause splicing defects and can be regulated by differential expression of RNA polymerase subunits. eLife 2015, 4, e09945. [Google Scholar] [CrossRef]
  42. Fisun, A.Y.; Cherkashin, D.V.; Tyrenko, V.V.; Zhdanov, C.V.; Kozlov, C.V. Role of renin-angiotensin-aldosterone system in the interaction with coronavirus SARS-CoV-2 and in the development of strategies for prevention and treatment of new coronavirus infection (COVID-19). Arterial’naya Gipertenz. Arter. Hypertens. 2020, 26, 248–262. [Google Scholar] [CrossRef]
  43. Yin, X.; Popa, H.; Stapon, A.; Bouda, E.; Garcia-Diaz, M. Fidelity of Ribonucleotide Incorporation by the SARS-CoV-2 Replication Complex. J. Mol. Biol. 2023, 435, 167973. [Google Scholar] [CrossRef] [PubMed]
  44. Collins, N.D.; Beck, A.S.; Widen, S.G.; Wood, T.G.; Higgs, S.; Barrett, A.D.T. Structural and nonstructural genes contribute to the genetic diversity of RNA viruses. MBio 2018, 9, e01871-18. [Google Scholar] [CrossRef] [PubMed]
  45. Cruz-González, A.; Muñoz-Velasco, I.; Cottom-Salas, W.; Becerra, A.; Campillo-Balderas, J.A.; Hernández-Morales, R.; Vázquez-Salazar, A.; Jácome, R.; Lazcano, A. Structural analysis of viral ExoN domains reveals polyphyletic hijacking events. PLoS ONE 2021, 16, e0246981. [Google Scholar] [CrossRef] [PubMed]
  46. Villa, T.G.; Abril, A.G.; Sánchez, S.; de Miguel, T.; Sánchez-Pérez, A. Animal and human RNA viruses: Genetic variability and ability to overcome vaccines. Arch. Microbiol. 2021, 203, 443–464. [Google Scholar] [CrossRef]
  47. Mattenberger, F.; Vila-Nistal, M.; Geller, R. Increased RNA virus population diversity improves adaptability. Sci. Rep. 2021, 11, 6824. [Google Scholar] [CrossRef]
  48. Bouvet, M.; Imbert, I.; Subissi, L.; Gluais, L.; Canard, B.; Decroly, E. RNA 3′-end mismatch excision by the severe acute respiratory syndrome coronavirus nonstructural protein nsp10/nsp14 exoribonuclease complex. Proc. Natl. Acad. Sci. USA 2012, 109, 9372–9377. [Google Scholar] [CrossRef] [PubMed]
  49. Robson, F.; Khan, K.S.; Le, T.K.; Paris, C.; Demirbag, S.; Barfuss, P.; Rocchi, P.; Ng, W.-L. Coronavirus RNA Proofreading: Molecular Basis and Therapeutic Targeting. Mol. Cell 2020, 79, 710–727. [Google Scholar] [CrossRef]
  50. Moeller, N.H.; Shi, K.; Demir, Ö.; Belica, C.; Banerjee, S.; Yin, L.; Durfee, C.; Amaro, R.E.; Aihara, H. Structure and dynamics of SARS-CoV-2 proofreading exoribonuclease ExoN. Proc. Natl. Acad. Sci. USA 2022, 119, e2106379119. [Google Scholar] [CrossRef]
  51. Amicone, M.; Borges, V.; Alves, M.J.; Isidro, J.; Zé-Zé, L.; Duarte, S.; Vieira, L.; Guiomar, R.; Gomes, J.P.; Gordo, I. Mutation rate of SARS-CoV-2 and emergence of mutators during experimental evolution. Evol. Med. Public Health 2022, 10, 142–155. [Google Scholar] [CrossRef]
  52. Koyama, T.; Platt, D.; Parida, L. Variant analysis of SARS-CoV-2 genomes. Bull. World Health Organ. 2020, 98, 495–504. [Google Scholar] [CrossRef]
  53. Yeh, T.Y.; Contreras, G.P. Emerging Viral Mutants in Australia Suggest RNA Recombination Event in the SARS-CoV-2 Genome. Med. J. Aust. 2020, 1. Available online: https://www.mja.com.au/journal/2020/213/1/emerging-viral-mutants-australia-suggest-rna-recombination-event-sars-cov-2 (accessed on 7 February 2023). [CrossRef]
  54. da Silva Francisco Junior, R.; de Almeida, L.G.P.; Lamarca, A.P.; Cavalcante, L.; Martins, Y.; Gerber, A.L.; de C. Guimarães, A.P.; Salviano, R.B.; dos Santos, F.L.; de Oliveira, T.H.; et al. Emergence of within-Host SARS-CoV-2 Recombinant Genome after Coinfection by Gamma and Delta Variants: A Case Report. Front. Public Health 2022, 10, 231. [Google Scholar] [CrossRef] [PubMed]
  55. Banerjee, A.; Doxey, A.C.; Benjamin, J.M.T.; Mansfield, M.J.; Subudhi, S.; Hirota, J.A.; Miller, M.S.; Andrew, G.M.; Mubareka, S.; Mossman, K. Predicting the recombination potential of severe acute respiratory syndrome coronavirus 2 and Middle East respiratory syndrome coronavirus. J. Gen. Virol. 2021, 101, 1251–1260. [Google Scholar] [CrossRef] [PubMed]
  56. Ball, P. Pattern Formation in Nature: Physical Constraints and Self-Organising Characteristics. Archit. Des. 2012, 82, 22–27. [Google Scholar] [CrossRef]
  57. Ball, P. Pattern of life. Nature 2000. [Google Scholar] [CrossRef]
  58. Rodriguez, S. Hardy–Weinberg Law. In Brenner’s Encyclopedia of Genetics; Elsevier: Amsterdam, The Netherlands, 2013; pp. 396–398. ISBN 9780080961569. [Google Scholar]
  59. Bhole, G.; Shukla, A.; Mahesh, T.S. Benford Analysis: A useful paradigm for spectroscopic analysis. Chem. Phys. Lett. 2014, 639, 36–40. [Google Scholar] [CrossRef]
  60. Berger, A.; Hill, T.P. The mathematics of Benford’s law: A primer. Stat. Methods Appt. 2021, 30, 779–795. [Google Scholar] [CrossRef]
  61. Das, J.K.; Das, P.; Ray, K.K.; Choudhury, P.P.; Jana, S.S. Mathematical Characterization of Protein Sequences Using Patterns as Chemical Group Combinations of Amino Acids. PLoS ONE 2016, 11, e0167651. [Google Scholar] [CrossRef]
  62. Sayers, E.W.; Barrett, T.; Benson, D.A.; Bolton, E.; Bryant, S.H.; Canese, K.; Chetvernin, V.; Church, D.M.; DiCuccio, M.; Federhen, S.; et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2012, 35, D5–D12. [Google Scholar] [CrossRef]
  63. O’Toole, Á.; Pybus, O.G.; Abram, M.E.; Kelly, E.J.; Rambaut, A. Pango lineage designation and assignment using SARS-CoV-2 spike gene nucleotide sequences. BMC Genom. 2022, 23, 121. [Google Scholar] [CrossRef]
  64. Embarak, D.O. Data Analysis and Visualization Using Python; Apress: Berkeley, CA, USA, 2018; ISBN 978-1-4842-4108-0. [Google Scholar]
  65. Ghosh, S.; Neha, K.; Praveen Kumar, Y. Data Wrangling Using Python. Int. J. Recent Technol. Eng. 2019, 8, 3491–3495. [Google Scholar] [CrossRef]
  66. Löytynoja, A.; Goldman, N. webPRANK: A phylogeny-aware multiple sequence aligner with interactive alignment browser. BMC Bioinform. 2010, 11, 579. [Google Scholar] [CrossRef] [PubMed]
  67. Holland, L.A.; Kaelin, E.A.; Maqsood, R.; Estifanos, B.; Wu, L.I.; Varsani, A.; Halden, R.U.; Hogue, B.G.; Scotch, M.; Lim, E.S. An 81 base-pair deletion in SARS-CoV-2 ORF7a identified from sentinel surveillance in Arizona (Jan-Mar 2020). medRxiv Prepr. Serv. Health Sci. 2020. [Google Scholar] [CrossRef]
  68. Yuan, F.; Wang, L.; Fang, Y.; Wang, L. Global SNP analysis of 11,183 SARS-CoV-2 strains reveals high genetic diversity. Transbound. Emerg. Dis. 2021, 68, 3288–3304. [Google Scholar] [CrossRef] [PubMed]
  69. Colson, P.; Delerce, J.; Burel, E.; Beye, M.; Fournier, P.-E.; Levasseur, A.; Lagier, J.-C.; Raoult, D. Occurrence of a substitution or deletion of SARS-CoV-2 spike amino acid 677 in various lineages in Marseille, France. Virus Genes 2022, 58, 53–58. [Google Scholar] [CrossRef]
  70. Kemp, S.A.; Meng, B.; Ferriera, I.A.T.M.; Datir, R.; Harvey, W.T.; Collier, D.A.; Lytras, S.; Papa, G.; Carabelli, A.; Kenyon, J.; et al. Recurrent Emergence and Transmission of a SARS-CoV-2 Spike Deletion H69/V70. SSRN Electron. J. 2021. [Google Scholar] [CrossRef]
  71. Benedetti, F.; Snyder, G.A.; Giovanetti, M.; Angeletti, S.; Gallo, R.C.; Ciccozzi, M.; Zella, D. Emerging of a SARS-CoV-2 viral strain with a deletion in nsp1. J. Transl. Med. 2020, 18, 329. [Google Scholar] [CrossRef]
  72. Loureiro, C.L.; Jaspe, R.C.; D’Angelo, P.; Zambrano, J.L.; Rodriguez, L.; Alarcon, V.; Delgado, M.; Aguilar, M.; Garzaro, D.; Rangel, H.R.; et al. SARS-CoV-2 genetic diversity in Venezuela: Predominance of D614G variants and analysis of one outbreak. PLoS ONE 2021, 16, e0247196. [Google Scholar] [CrossRef]
  73. Zanchi, F.B.; Mariúba, L.A.; Nascimento, V.; Souza, V.; Corado, A.; Nascimento, F.; Costa, Á.K.; Duarte, D.; Silva, G.; Mejía, M.; et al. Structural analysis of SARS-Cov-2 nonstructural protein 1 polymorphisms found in the Brazilian Amazon. Exp. Biol. Med. 2021, 246, 2332–2337. [Google Scholar] [CrossRef]
  74. Lesbon, J.C.C.; Poleti, M.D.; de Mattos Oliveira, E.C.; Patané, J.S.L.; Clemente, L.G.; Viala, V.L.; Ribeiro, G.; Giovanetti, M.; de Alcantara, L.C.J.; de Lima, L.P.O.; et al. Nucleocapsid (N) gene mutations of sars-cov-2 can affect real-time rt-pcr diagnostic and impact false-negative results. Viruses 2021, 13, 2474. [Google Scholar] [CrossRef]
  75. Finkel, Y.; Mizrahi, O.; Nachshon, A.; Weingarten-Gabbay, S.; Morgenstern, D.; Yahalom-Ronen, Y.; Tamir, H.; Achdout, H.; Stein, D.; Israeli, O.; et al. The coding capacity of SARS-CoV-2. Nature 2021, 589, 125–130. [Google Scholar] [CrossRef] [PubMed]
  76. Firth, A.E. A putative new SARS-CoV protein, 3c, encoded in an ORF overlapping ORF3a. J. Gen. Virol. 2020, 101, 1085–1089. [Google Scholar] [CrossRef] [PubMed]
  77. Patten, M.L. The Mean and Standard Deviation. In Understanding Research Methods; Routledge: New York, NY, USA, 2017; pp. 137–138. [Google Scholar] [CrossRef]
  78. Krick, T.; Verstraete, N.; Alonso, L.G.; Shub, D.A.; Ferreiro, D.U.; Shub, M.; Sánchez, I.E. Amino Acid Metabolism Conflicts with Protein Diversity. Mol. Biol. Evol. 2014, 31, 2905–2912. [Google Scholar] [CrossRef] [PubMed]
  79. Mohanta, T.K.; Mohanta, Y.K.; Avula, S.K.; Nongbet, A.; Al-Harrasi, A. Virtual 2D map of cyanobacterial proteomes. PLoS ONE 2022, 17, e0275148. [Google Scholar] [CrossRef] [PubMed]
  80. Mohanta, T.K.; Khan, A.; Hashem, A.; Abd_Allah, E.F.; Al-Harrasi, A. The molecular mass and isoelectric point of plant proteomes. BMC Genom. 2019, 20, 631. [Google Scholar] [CrossRef]
  81. Miseta, A.; Csutora, P. Relationship Between the Occurrence of Cysteine in Proteins and the Complexity of Organisms. Mol. Biol. Evol. 2000, 17, 1232–1239. [Google Scholar] [CrossRef]
  82. Haque, S.M.; Ashwaq, O.; Sarief, A.; Azad John Mohamed, A.K. A comprehensive review about SARS-CoV-2. Future Virol. 2020, 15, 625–648. [Google Scholar] [CrossRef]
  83. Masters, P.S. The Molecular Biology of Coronaviruses. Adv. Virus Res. 2006, 66, 193–292. [Google Scholar] [CrossRef]
  84. San Martín, C. Structure and Assembly of Complex Viruses. Subcell. Biochem. 2013, 68, 329–360. [Google Scholar] [CrossRef]
  85. Serwin, K.; Ossowski, A.; Szargut, M.; Cytacka, S.; Urbańska, A.; Majchrzak, A.; Niedźwiedź, A.; Czerska, E.; Pawińska-Matecka, A.; Gołąb, J.; et al. Molecular Evolution and Epidemiological Characteristics of SARS COV-2 in (Northwestern) Poland. Viruses 2021, 13, 1295. [Google Scholar] [CrossRef]
  86. Hodcroft, E.B.; Zuber, M.; Nadeau, S.; Vaughan, T.G.; Crawford, K.H.D.; Althaus, C.L.; Reichmuth, M.L.; Bowen, J.E.; Walls, A.C.; Corti, D.; et al. Spread of a SARS-CoV-2 variant through Europe in the summer of 2020. Nature 2021, 595, 707–712. [Google Scholar] [CrossRef] [PubMed]
  87. Rafael Ciges-Tomas, J.; Franco, M.L.; Vilar, M. Identification of a guanine-specific pocket in the protein N of SARS-CoV-2. Commun. Biol. 2022, 5, 711. [Google Scholar] [CrossRef] [PubMed]
  88. Dong, H.; Wang, S.; Zhang, J.; Zhang, K.; Zhang, F.; Wang, H.; Xie, S.; Hu, W.; Gu, L. Structure-Based Primer Design Minimizes the Risk of PCR Failure Caused by SARS-CoV-2 Mutations. Front. Cell. Infect. Microbiol. 2021, 11, 741147. [Google Scholar] [CrossRef] [PubMed]
  89. Santiveri, C.M.; Jiménez, M.A. Tryptophan residues: Scarce in proteins but strong stabilizers of β-hairpin peptides. Biopolymers 2010, 94, 779–790. [Google Scholar] [CrossRef] [PubMed]
  90. Bielecki, M.; Wójtowicz, H.; Olczak, T. Differential roles of tryptophan residues in conformational stability of Porphyromonas gingivalis HmuY hemophore. BMC Biochem. 2014, 15, 2. [Google Scholar] [CrossRef]
  91. Bansal, K.; Kumar, S. Mutational cascade of SARS-CoV-2 leading to evolution and emergence of omicron variant. Virus Res. 2022, 315, 198765. [Google Scholar] [CrossRef]
  92. Rajpal, V.R.; Sharma, S.; Kumar, A.; Chand, S.; Joshi, L.; Chandra, A.; Babbar, S.; Goel, S.; Raina, S.N.; Shiran, B. “Is Omicron mild”? Testing this narrative with the mutational landscape of its three lineages and response to existing vaccines and therapeutic antibodies. J. Med. Virol. 2022, 94, 3521–3539. [Google Scholar] [CrossRef]
  93. Barton, M.I.; MacGowan, S.A.; Kutuzov, M.A.; Dushek, O.; Barton, G.J.; van der Merwe, P.A. Effects of common mutations in the SARS-CoV-2 Spike RBD and its ligand, the human ACE2 receptor on binding affinity and kinetics. eLife 2021, 10, e70658. [Google Scholar] [CrossRef]
  94. Liu, H.; Wei, P.; Zhang, Q.; Chen, Z.; Aviszus, K.; Downing, W.; Peterson, S.; Reynoso, L.; Downey, G.P.; Frankel, S.K.; et al. 501Y.V2 and 501Y.V3 variants of SARS-CoV-2 lose binding to bamlanivimab in vitro. MAbs 2021, 13, 1919285. [Google Scholar] [CrossRef]
  95. Teng, S.; Sobitan, A.; Rhoades, R.; Liu, D.; Tang, Q. Systemic effects of missense mutations on SARS-CoV-2 spike glycoprotein stability and receptor-binding affinity. Brief. Bioinform. 2021, 22, 1239–1253. [Google Scholar] [CrossRef]
  96. Arora, P.; Pöhlmann, S.; Hoffmann, M. Mutation D614G increases SARS-CoV-2 transmission. Signal Transduct. Target. Ther. 2021, 6, 101. [Google Scholar] [CrossRef] [PubMed]
  97. Aleem, A.; Akbar Samad, A.B.; Slenker, A.K. Emerging Variants of SARS-CoV-2 and Novel Therapeutics against Coronavirus (COVID-19); StatPearls: Treasure Island, FL, USA, 2021. [Google Scholar]
  98. Hadfield, J.; Megill, C.; Bell, S.M.; Huddleston, J.; Potter, B.; Callender, C.; Sagulenko, P.; Bedford, T.; Neher, R.A. Nextstrain: Real-time tracking of pathogen evolution. Bioinformatics 2018, 34, 4121–4123. [Google Scholar] [CrossRef] [PubMed]
  99. Wolter, N.; Jassat, W.; Walaza, S.; Welch, R.; Moultrie, H.; Groome, M.J.; Amoako, D.G.; Everatt, J.; Bhiman, J.N.; Scheepers, C.; et al. Clinical severity of SARS-CoV-2 Omicron BA.4 and BA.5 lineages compared to BA.1 and Delta in South Africa. Nat. Commun. 2022, 13, 19. [Google Scholar] [CrossRef] [PubMed]
  100. Tegally, H.; Moir, M.; Everatt, J.; Giovanetti, M.; Scheepers, C.; Wilkinson, E.; Subramoney, K.; Makatini, Z.; Moyo, S.; Amoako, D.G.; et al. Emergence of SARS-CoV-2 Omicron lineages BA.4 and BA.5 in South Africa. Nat. Med. 2022, 28, 1785–1790. [Google Scholar] [CrossRef]
  101. Mukaka, M.M. Statistics corner: A guide to appropriate use of correlation coefficient in medical research. Malawi Med. J. 2012, 24, 69–71. [Google Scholar]
  102. Du, Q.; Wei, D.; Chou, K.-C. Correlations of amino acids in proteins. Peptides 2003, 24, 1863–1869. [Google Scholar] [CrossRef]
  103. Rehman, S.; Mahmood, T.; Aziz, E.; Batool, R. Identification of novel mutations in SARS-COV-2 isolates from Turkey. Arch. Virol. 2020, 165, 2937–2944. [Google Scholar] [CrossRef]
  104. Nasser, H.; Shimizu, R.; Ito, J.; Matsuno, K.; Nao, N.; Sawa, H.; Kishimoto, M.; Tanaka, S.; Tsuda, M.; Wang, L.; et al. Monitoring fusion kinetics of viral and target cell membranes in living cells using a SARS-CoV-2 spike-protein-mediated membrane fusion assay. STAR Protoc. 2022, 3, 101773. [Google Scholar] [CrossRef]
  105. Gong, Y.-N.; Tsao, K.-C.; Hsiao, M.-J.; Huang, C.-G.; Huang, P.-N.; Huang, P.-W.; Lee, K.-M.; Liu, Y.-C.; Yang, S.-L.; Kuo, R.-L.; et al. SARS-CoV-2 genomic surveillance in Taiwan revealed novel ORF8-deletion mutant and clade possibly associated with infections in Middle East. Emerg. Microbes Infect. 2020, 9, 1457–1466. [Google Scholar] [CrossRef]
  106. Su, Y.C.F.; Anderson, D.E.; Young, B.E.; Linster, M.; Zhu, F.; Jayakumar, J.; Zhuang, Y.; Kalimuddin, S.; Low, J.G.H.; Tan, C.W.; et al. Discovery and Genomic Characterization of a 382-Nucleotide Deletion in ORF7b and ORF8 during the Early Evolution of SARS-CoV-2. MBio 2020, 11, 4. [Google Scholar] [CrossRef]
  107. Young, B.E.; Fong, S.-W.; Chan, Y.-H.; Mak, T.-M.; Ang, L.W.; Anderson, D.E.; Lee, C.Y.-P.; Amrun, S.N.; Lee, B.; Goh, Y.S.; et al. Effects of a major deletion in the SARS-CoV-2 genome on the severity of infection and the inflammatory response: An observational cohort study. Lancet 2020, 396, 603–611. [Google Scholar] [CrossRef]
  108. Zhou, W.; Wang, W. Fast-spreading SARS-CoV-2 variants: Challenges to and new design strategies of COVID-19 vaccines. Signal Transduct. Target. Ther. 2021, 6, 226. [Google Scholar] [CrossRef] [PubMed]
  109. Zinzula, L. Lost in deletion: The enigmatic ORF8 protein of SARS-CoV-2. Biochem. Biophys. Res. Commun. 2021, 538, 116–124. [Google Scholar] [CrossRef] [PubMed]
  110. Chou, J.-M.; Tsai, J.-L.; Hung, J.-N.; Chen, I.-H.; Chen, S.-T.; Tsai, M.-H. The ORF8 Protein of SARS-CoV-2 Modulates the Spike Protein and Its Implications in Viral Transmission. Front. Microbiol. 2022, 13, 883597. [Google Scholar] [CrossRef] [PubMed]
  111. Erster, O.; Mendelson, E.; Kabat, A.; Levy, V.; Mannasse, B.; Assraf, H.; Azar, R.; Ali, Y.; Bucris, E.; Bar-Ilan, D.; et al. Specific Detection of SARS-CoV-2 Variants B.1.1.7 (Alpha) and B.1.617.2 (Delta) Using a One-Step Quantitative PCR Assay. Microbiol. Spectr. 2022, 10, 2. [Google Scholar] [CrossRef]
  112. Azad, G.K.; Khan, P.K. Variations in Orf3a protein of SARS-CoV-2 alter its structure and function. Biochem. Biophys. Rep. 2021, 26, 100933. [Google Scholar] [CrossRef]
  113. Hassan, S.S.; Attrish, D.; Ghosh, S.; Choudhury, P.P.; Roy, B. Pathogenic perspective of missense mutations of ORF3a protein of SARS-CoV-2. Virus Res. 2021, 300, 198441. [Google Scholar] [CrossRef]
  114. Sengupta, A.; Hassan, S.S.; Choudhury, P.P. Clade GR and clade GH isolates of SARS-CoV-2 in Asia show highest amount of SNPs. Infect. Genet. Evol. 2021, 89, 104724. [Google Scholar] [CrossRef]
  115. Abbas, Q.; Kusakin, A.; Sharrouf, K.; Jyakhwo, S.; Komissarov, A.S. Follow-up investigation and detailed mutational characterization of the SARS-CoV-2 Omicron variant lineages (BA.1, BA.2, BA.3 and BA.1.1). bioRxiv 2022. [Google Scholar] [CrossRef]
  116. Zeng, H.-L.; Dichio, V.; Rodríguez Horta, E.; Thorell, K.; Aurell, E. Global analysis of more than 50,000 SARS-CoV-2 genomes reveals epistasis between eight viral genes. Proc. Natl. Acad. Sci. USA 2020, 117, 31519–31526. [Google Scholar] [CrossRef]
  117. Velazquez-Salinas, L.; Zarate, S.; Eberl, S.; Gladue, D.P.; Novella, I.; Borca, M.V. Positive Selection of ORF1ab, ORF3a, and ORF8 Genes Drives the Early Evolutionary Trends of SARS-CoV-2 during the 2020 COVID-19 Pandemic. Front. Microbiol. 2020, 11, 550674. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Average amino acid frequency and composition plots for SARS-CoV-2: (A) Average amino acid frequency and (B) average amino acid composition of all the SARS-CoV-2 proteomes.
Figure 1. Average amino acid frequency and composition plots for SARS-CoV-2: (A) Average amino acid frequency and (B) average amino acid composition of all the SARS-CoV-2 proteomes.
Biomedicines 11 00512 g001
Figure 2. Amino acid composition of the 11 SARS-CoV-2 variants.
Figure 2. Amino acid composition of the 11 SARS-CoV-2 variants.
Biomedicines 11 00512 g002
Figure 3. Correlation matrix of the various SARS-CoV-2 variants using their AACs.
Figure 3. Correlation matrix of the various SARS-CoV-2 variants using their AACs.
Biomedicines 11 00512 g003
Figure 4. Correlation plots showing the relationship among the 20 standard amino acids of SARS-CoV-2 at (A) no p-value, (B) p < 0.001, (C) p < 0.01, and (D) p < 0.05.
Figure 4. Correlation plots showing the relationship among the 20 standard amino acids of SARS-CoV-2 at (A) no p-value, (B) p < 0.001, (C) p < 0.01, and (D) p < 0.05.
Biomedicines 11 00512 g004
Figure 5. Probability plots to determine the normality for amino acid residues (A) Serine and (B) Methionine.
Figure 5. Probability plots to determine the normality for amino acid residues (A) Serine and (B) Methionine.
Biomedicines 11 00512 g005
Figure 6. Bar plots showing the (A) difference between amino acid composition per variant and the mean amino acid composition and (B) z-scores of the amino acid compositions belonging to each of the 11 SARS-CoV-2 variants.
Figure 6. Bar plots showing the (A) difference between amino acid composition per variant and the mean amino acid composition and (B) z-scores of the amino acid compositions belonging to each of the 11 SARS-CoV-2 variants.
Biomedicines 11 00512 g006
Table 1. Details of proteome samples of SARS-CoV-2 variants obtained from NCBI.
Table 1. Details of proteome samples of SARS-CoV-2 variants obtained from NCBI.
Nextstrain CladePango LineageWHO LabelNumber of FilesRemarks
Before CleaningAfter Cleaning
20I (Alpha, V1)B.1.1.7Alpha830706-
20H (Beta, V2)B.1.351Beta164-
20J (Gamma, V3)P.1Gamma7973-
21A (Delta)B.1.617.2Delta10043-
21B (Kappa)B.1.617.1Kappa1-Part of B.1.617.2
21C (Epsilon)B.1.427, B.1.429Epsilon0--
21D (Eta)B.1.525Eta1313-
21F (Iota)B.1.526Iota3085773-
21G (Lambda)C.37Lambda198-
21H (Mu)B.1.621Mu0--
21K (Omicron)BA.1Omicron2613-
21L (Omicron)BA.2Omicron23-Part of BA.1
22A (Omicron)BA.4Omicron32-
22B (Omicron)BA.5Omicron230-
22C (Omicron)BA.2.12.1Omicron0--
22D (Omicron)BA.2.75Omicron0--
20E (EU1)B.1.177 11-
20B/S:732AB.1.1.519 0--
20A/S:126AB.1.620 0--
20A.EU2B.1.160 11-
20A/S:439KB.1.258 0--
20A/S:98FB.1.221 0--
20C/S:80YB.1.367 0--
20B/S:626SB.1.1.277 0--
20B/S:1122LB.1.1.302 0--
Table 2. Average amino acid count for each of the 11 lineages used in this study. The values presented here are rounded up to 3 decimal places.
Table 2. Average amino acid count for each of the 11 lineages used in this study. The values presented here are rounded up to 3 decimal places.
ResiduesSARS-CoV-2 Variants
B.1.1.7B.1.351P.1B.1.617.2B.1.525B.1.526C.37BA.1BA.4B.1.177B.1.160
M311.508312311.973312.512312310.979312311.923312312309
E679.016680678.123677.721680680.049681680.385680681680
S950.745952.75951.849952.535950.615952.779956.75950.846949955955
L1363.7213651366.751366.581365.691364.881363.121363.12136213701365
V1148.6411531150.741147.141152.621152.571154.621152.77115211541151
P553.214556555.438549.953557556.211555552.923551.5557557
G838.057840837.781839.465840839.507833836.615835840841
F707.078707708.301703.884707707.248703710.385709.5709710
N763.834767.5764.932765.395765.154765.133767.5763763765766
K836.659834837.63836.535838.846838.477838.75841.615842836838
T1060.6510591061.751056.9110611058.8510551056.46105210641064
H263.683265263.192262.047264265.706264265.846267267265
Q514.037515517.534514.14514513.877517513.538513511515
R479.565483480.959480.837482481.034480482.923483481481
D718.683718.5720.493716.093720719.699720719.769719.5722722
A964.416965.75966.329965.698964.923965.577967.375965.462967963965
C432.669434433.904432.047434433.875435434.154435434434
Y640.779644645.589642.93643642.849642644.538645.5642643
I727.377731.75728.986727.116729.154732.816733.875734737729731
W156.469156.75156.959156.721157156.951157157157157157
Table 3. Average amino acid composition for each SARS-CoV-2 variant. Values are presented up to 3 decimal places.
Table 3. Average amino acid composition for each SARS-CoV-2 variant. Values are presented up to 3 decimal places.
ResiduesAverage Amino Acid Composition (AAC) of SARS-CoV-2 VariantsAvg AAC
B.1.1.7B.1.351P.1B.1.617.2B.1.525B.1.526C.37BA.1BA.4B.1.177B.1.160
M2.2082.2072.2062.2152.2072.1992.2072.2062.2082.2052.1842.205
E4.8124.8104.7964.8044.8104.8104.8174.8134.8124.8134.8064.809
S6.7386.7386.7326.7536.7246.7396.7686.7266.7156.7506.7506.739
L9.6649.6539.6669.6889.6609.6539.6439.6479.6389.6839.6479.658
V8.1408.1548.1398.1328.1538.1528.1688.1548.1528.1568.1358.149
P3.9213.9323.9283.8993.9403.9343.9263.9113.9023.9373.9373.924
G5.9395.9415.9255.9515.9415.9385.8935.9185.9095.9375.9445.930
F5.01155.0094.9905.0015.0024.9735.0255.0215.0115.0185.005
N5.4135.4285.4105.4265.4125.4115.4295.3975.3105.4075.4145.413
K5.9295.8985.9245.9305.9335.9305.9335.9535.9585.9095.9235.929
T7.5177.4897.5097.4927.5057.4897.4637.4737.4447.5207.5207.493
H1.8691.8741.8611.8581.8671.8791.8681.8801.8891.8871.8731.873
Q3.6433.6423.6603.6453.6363.6343.6573.6323.6303.6123.6403.639
R3.3993.4163.4023.4093.4093.4023.3963.4163.4183.4003.4003.406
D5.0935.0815.0965.0765.0935.0905.0935.0915.0915.1035.1035.092
A6.8346.8306.8346.8466.8256.8296.8436.8296.8436.8066.8206.831
C3.0663.0693.0693.0633.0703.0693.0773.0713.0783.0673.0673.070
Y4.5414.5544.5664.5584.5484.5474.5424.5594.5684.5374.5444.551
I5.1555.1755.1565.1555.1575.1835.1925.1925.2155.1525.1665.173
W1.1091.1091.1101.1111.1101.1101.1111.1101.1111.1101.1101.110
Table 4. Test for normality for the amino acid compositions using Shapiro-Wilk test, skewness and kurtosis. The values presented are rounded to 3 decimal places.
Table 4. Test for normality for the amino acid compositions using Shapiro-Wilk test, skewness and kurtosis. The values presented are rounded to 3 decimal places.
ResiduesSkewnessKurtosisShapiro Wilk Test
Statisticp-ValueTest Statistic (W)p-Value
M2.8600.0042.9680.7280.001
E−1.8340.0670.9780.9060.217
S0.4950.621−0.5100.9730.922
L1.1530.249−0.6610.9310.425
V0.0400.968−0.7720.9220.336
P−1.3010.193−0.8960.8840.115
G−1.7470.081−0.0950.8790.102
F−1.4630.1430.1670.9360.473
N0.2510.802−0.9360.9230.348
K0.0500.960−0.2030.9380.496
T−1.1280.259−0.6430.9230.341
H0.2780.781−0.9610.9680.865
Q−0.5620.5740.2100.9470.610
R0.6310.528−1.4010.8920.147
D−0.8770.380−0.1160.9110.250
A−1.2260.2200.1020.9380.493
C1.3380.181−0.0870.8910.143
Y0.5680.570−1.2200.9370.481
I1.3830.167−0.5200.8730.085
W−1.0540.292−0.7590.9270.384
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Broni, E.; Miller, W.A., III. Computational Analysis Predicts Correlations among Amino Acids in SARS-CoV-2 Proteomes. Biomedicines 2023, 11, 512. https://doi.org/10.3390/biomedicines11020512

AMA Style

Broni E, Miller WA III. Computational Analysis Predicts Correlations among Amino Acids in SARS-CoV-2 Proteomes. Biomedicines. 2023; 11(2):512. https://doi.org/10.3390/biomedicines11020512

Chicago/Turabian Style

Broni, Emmanuel, and Whelton A. Miller, III. 2023. "Computational Analysis Predicts Correlations among Amino Acids in SARS-CoV-2 Proteomes" Biomedicines 11, no. 2: 512. https://doi.org/10.3390/biomedicines11020512

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop