Computational Analysis Predicts Correlations among Amino Acids in SARS-CoV-2 Proteomes

Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is a serious global challenge requiring urgent and permanent therapeutic solutions. These solutions can only be engineered if the patterns and rate of mutations of the virus can be elucidated. Predicting mutations and the structure of proteins based on these mutations have become necessary for early drug and vaccine design purposes in anticipation of future viral mutations. The amino acid composition (AAC) of proteomes and individual viral proteins provide avenues for exploitation since AACs have been previously used to predict structure, shape and evolutionary rates. Herein, the frequency of amino acid residues found in 1637 complete proteomes belonging to 11 SARS-CoV-2 variants/lineages were analyzed. Leucine is the most abundant amino acid residue in the SARS-CoV-2 with an average AAC of 9.658% while tryptophan had the least abundance of 1.11%. The AAC and ranking of lysine and glycine varied in the proteome. For some variants, glycine had higher frequency and AAC than lysine and vice versa in other variants. Tryptophan was also observed to be the most intolerant to mutation in the various proteomes for the variants used. A correlogram revealed a very strong correlation of 0.999992 between B.1.525 (Eta) and B.1.526 (Iota) variants. Furthermore, isoleucine and threonine were observed to have a very strong negative correlation of −0.912, while cysteine and isoleucine had a very strong positive correlation of 0.835 at p < 0.001. Shapiro-Wilk normality test revealed that AAC values for all the amino acid residues except methionine showed no evidence of non-normality at p < 0.05. Thus, AACs of SARS-CoV-2 variants can be predicted using probability and z-scores. AACs may be beneficial in classifying viral strains, predicting viral disease types, members of protein families, protein interactions and for diagnostic purposes. They may also be used as a feature along with other crucial factors in machine-learning based algorithms to predict viral mutations. These mutation-predicting algorithms may help in developing effective therapeutics and vaccines for SARS-CoV-2.


Introduction
Proteins play crucial roles in many biological processes in organisms and they remain a major source of druggable targets [1]. Protein homologs usually have similar structures and functions [2]. In addition, proteins belonging to the same family and having related shapes tend to interact with other molecules in similar ways. Proteins bind to other molecules in order to complete specific tasks, which is highly influenced by the way the protein's exposed surfaces interact with the molecules. The proteome of an organism consists of the proteins that are or can be expressed by a cell, tissue or organism, including intrinsically disordered proteins (IDP) or regions (IDR) [3,4].
Proteins are formed by long chains of amino acids which are connected via covalent peptide bonds [5,6]. Recent studies have analyzed the amino acids that make up proteins

Retrieving Amino Acid Sequences
FASTA files containing the amino acid sequences of the various SARS-CoV-2 proteins were retrieved from the National Center for Biotechnology Information (NCBI) database (https://www.ncbi.nlm.nih.gov accessed on 10 September 2022) using the keywords SARS-CoV-2, complete genome and the "Pango lineage" [62]. The sequences were retrieved using the Pango lineage nomenclature classification of the various SARS-CoV-2 variants [63] which were obtained from CoVariants (https://www.covariants.org/variants accessed on 10 September 2022) and the World Health Organization (WHO) Tracking SARS-CoV-2 variants (https://www.who.int/activities/tracking-SARS-CoV-2-variants accessed on 10 September 2022).

Data Pre-Processing
Based on Pango lineage, the amino acid sequences were split into separate files. All lines in each text or FASTA file that do not contain amino acid residues were removed using the "sed" command on linux terminal. Each file was then stripped of newlines and whitespace characters in order to combine all the sequences as one continuous string. The files were then read and saved into "DataFrames" based on their lineages using Pandas, an open source data analysis and manipulation tool for Python [64,65]. All strains that had non-standard amino acid residues in their proteome sequences were removed from the data.

Determining Amino Acid Frequencies
For each strain (or sample), the frequencies and AACs of the standard amino acids were determined using the "counter" function in Python and saved into "Dataframes". The average AACs per variant were also determined by calculating the percentages of each residue per variant and for the virus as a whole. Furthermore, the standard deviations were computed and stored. The "Dataframes" were then exported as Excel workbooks.

Correlation Analyses
Correlation analyses were performed on the SARS-CoV-2 variants and the amino acid residues to investigate their relationships based on their AACs. R studio was employed to generate the correlation plots for the amino acid residues after loading the Excel workbooks. The "Hmisc" package in R was used to generate the correlation coefficients with their significance levels (p-values) by using the "rcorr" function. The "corrplot" package was then used to plot the correlograms.

Test for Normality
The study further tested the normality of each amino acid composition's distribution using the Shapiro-Wilk test, skewness, and kurtosis. Using the "shapiro" function in the "scipy" library, the test statistic and corresponding p-values were determined. The "skewtest" and "kurtosis" functions were employed to compute the skewness and kurtosis of the distribution, respectively. P-P plots were also generated to visualize the normality of the amino acid AAC distributions.

Validating Correlation Analysis
A multiple template alignment tool, WebPRANK was employed to align a maximum of 3 proteome sequences from each variant/lineage [66]. For variants that had fewer than 3 proteome files, all the available sequences were used. In all, a total of 28 sequences  4, and one each from B.1.177 and B.1.160. The samples are labelled as "A", "B" and "C" to enable readability of the analysis. The output from the alignment was then compared to the AAC analyses.

Results and Discussion
Based on the search via the National Center for Biotechnology Information (NCBI) database, the data obtained were grouped into 12 variants (Table 1)   All proteome sequences that had non-standard or ambiguous amino acid residues such as Xaa (unspecified or unknown amino acid, X), Glx (glutamine or glutamic acid, Z), Asx (asparagine or aspartic acid, B), and Xle (leucine or isoleucine, J) were removed from the dataset. All 23 proteome sequences of BA.5 had one or more of the non-standard amino acid residues thus, were not included in this study (Table 1) 14,136. Computing the complete amino acid count considering all 17 proteins in the SARS-CoV-2 proteome revealed a total of~14,439 residues. The inconsistency in the total residue counts obtained herein may be due to deletions and insertions in various SARS-CoV-2 genome nucleotides, resulting in fewer or more amino acid residues in the proteins, respectively [67,68]. For instance, an 81 base-pair deletion was observed in a SARS-CoV-2 case in Arizona, resulting in a 27 amino acid deletion in the Open reading frame 7a (ORF7a) protein [67]. Several other deletions have been reported in the SARS-CoV-2 spike protein [38,69,70] and non-structural protein 1 (nsp1) [71][72][73]. Amino acid deletions in the other SARS-CoV-2 proteins have also been highlighted in literature [68,74]. In addition, some SARS-CoV-2 ORF proteins such as ORF3c (~41 amino acids) are translated during infection [75,76]. The sample data obtained from NCBI did not contain sequences of ORF9b (9b) [~97 amino acids], ORF9c (9c) [~73 amino acids], ORF3b (3b) [~22 amino acids], ORF3c (3c) [~41 amino acids], and ORF3d (3d) [~57 amino acids] thus, the lower number of amino acid residues. Without these 5 proteins, the amino acid count is~14,149, consistent with those obtained herein.

Amino Acid Frequencies per Variant
The frequencies of the 20 standard amino acids in each of the 1637 complete SARS-CoV-2 proteomes were determined. For each of the 11 variants, the average frequencies of the amino acid residues were also determined ( Table 2). The average amino acid frequencies of all SARS-CoV-2 variants along with their standard deviations were also computed using the mean frequencies of each variant ( Figure 1A and Table S1). The variant averages were used in order to avoid biases owing to the uneven data sets of the variants (the uneven number of samples for each variant). It was observed that leucine had the highest mean frequency of 1365.131 (Table S1) Table 2].
The standard deviation provides a measure to determine the dispersion of the data from the mean [77]. The standard deviation of the amino acid frequency provides information on the extent of mutation by each residue in the various variants. On average, threonine was observed to have the highest standard deviation of 3.743 followed by isoleucine, glycine, proline, phenylalanine, lysine, and serine with 3.11, 2.467, 2.421, 2.357, 2.337, and 2.296, respectively (Table S1), signifying the relatively higher involvement of these residues in SARS-CoV-2 mutations. Tryptophan, cysteine and methionine on the other hand, were observed to have the least standard deviations of 0.175, 0.864, and 0.95, respectively (Table S1), implying that they are less involved in SARS-CoV-2 mutations.  Table 2].
The standard deviation provides a measure to determine the dispersion of the data from the mean [77]. The standard deviation of the amino acid frequency provides information on the extent of mutation by each residue in the various variants. On average, threonine was observed to have the highest standard deviation of 3.743 followed by isoleucine, glycine, proline, phenylalanine, lysine, and serine with 3.11, 2.467, 2.421, 2.357, 2.337, and 2.296, respectively (Table S1), signifying the relatively higher involvement of these residues in SARS-CoV-2 mutations. Tryptophan, cysteine and methionine on the other hand, were observed to have the least standard deviations of 0.175, 0.864, and 0.95, respectively (Table S1), implying that they are less involved in SARS-CoV-2 mutations.

Highest and Least Represented Amino Acids in the SARS-CoV-2 Proteomes
Due to the inconsistency in amino acid frequencies, the AAC for the various SARS-CoV-2 variants were determined ( Figure 2 and Table 3). The AAC provides a more accurate metric to compare the amino acids in the various SARS-CoV-2 proteomes since it deals with the percentage of each residue in the proteome. The AAC caters for the inconsistencies in residue frequencies caused by insertion or deletion of genomic nucleotides. Proteome-wide analysis of the various SARS-CoV-2 variants showed that Leu was the most abundant amino acid with an average AAC of 9.658% while Trp was the least with an average AAC of 1.110% ( Figure 1B and Table 3). Leucine is the most abundant amino acid in proteomes [78] and has been shown to be the most abundant in cyanobacterial proteomes [79] and even plant proteomes from 144 plant species [80]. Herein, valine, threonine, alanine and serine are the other highly abundant amino acids with average AACs of 8.149, 7.493, 6.831, and 6.739%, respectively ( Figures 1B and 2 and Table 3). In plant proteomes, Trp was also observed to be the least encoded [80]. Other low abundant SARS-CoV-2 amino acids in addition to Trp include His, Met, and Cys with average AAC values of 1.873, 2.205, and 3.070%, respectively ( Table 3). Trp and Cys have previously been reported to be the least abundant in proteins found in Swissprot and TrEMBL databases [78]. A previous study showed that cysteine abundance positively correlates with the complexity of the organism although cysteine is underrepresented in all organisms [81]. Cysteine occurrence in Thermus aquatcus, Haloarcula maresmortui, Escherichia coli, Saccharomyes cerevisiae, Drosophila, and human proteins were 0.4, 0.5, 1.1, 1.3, 1.9, and 2.3%, respectively [81]. Although coronaviruses are structurally complex [82][83][84], the classification of viruses as organisms is highly debated. Investigating the cysteine occurrence in different viruses is necessary to determine if viral complexity (among viruses) also correlates with the cysteine content. On average, almost 55.61% of the SARS-CoV-2 proteome were observed to be non-polar amino acids while 44.39% were polar (Table 3). The average AACs per variant were also determined in order to analyze the highest and lowest abundant amino acids for each variant. The AACs can help determine the relationship between variants since AACs are signatures for organisms and environments [13]  The average AACs per variant were also determined in order to analyze the highest and lowest abundant amino acids for each variant. The AACs can help determine the relationship between variants since AACs are signatures for organisms and environments [13].  [85]. Moreover, both variants were first identified in Europe and there is no evidence of increased viral transmissibility between them [86].  (Table 3). Tryptophan was observed to be relatively equal in all the SARS-CoV-2 variants analyzed herein. Previous studies have shown that tryptophan in different SARS-CoV-2 proteins are conserved and replacing them might make the proteins unstable [87,88]. This may be due to the fact that tryptophan is encoded by only one codon, UGG. The paucity of the UGG codon in the genome makes it difficult to allow for any mutations, which might result in highly unstable proteins in the proteome as a whole. A recent study has shown that mutating tryptophan in the neucleocapsid (N) protein makes the virus very unstable [87]. Tryptophan has been reported to play an important role in the structural stability of proteins [89,90].

Lysine and Glycine Counts and Ranking
Sorting the amino acid frequencies and AACs in descending order revealed that the positions of lysine and glycine varied in the proteomes of the SARS-CoV-2 variants. Glycine and lysine had mean values of 838.220 and 838.047 with standard deviations of 2.467 and 2.337, respectively, for all SARS-CoV-2 variants. For variants B. 1.1.7, B.1.160, B.1.177,  B.1.351, B.1.525, B.1.526, B.1.617.2, and P.1, glycine had higher counts (ranked 6th most abundant) than lysine (7th most abundant) and vice versa for variants BA.1, BA.4, and C.37. For variants BA.1, BA.4, and C.37, lysine was the 6th most abundant while glycine was the 7th. Variants BA.1 and BA.4 both belong to the omicron variant of the SARS-CoV-2 while C.37 is a lambda variant. Lambda has been reported to share a common ancestry with omicron in contrast with delta [91]. Omicron and lambda also share common mutations in the genomic regions of ORF1b, spike protein and ORF8 [92]. The difference in lysine and glycine counts in the various SARS-CoV-2 variants necessitates the investigation of their effects in viral transmission, replication, survival, and other mechanisms. Lysine mutation (K417N and E484K) in the spike protein has been shown to significantly influence Angiotensin-converting enzyme 2 (ACE2) binding [93,94], suggesting lysine's critical role in viral entry. An in silico structure-based energy calculation suggests that mutations at Gly431, Gly648, and Gly35 may cause massive destabilizing effects on the full-length spike protein [95]. In addition, variants with the D614G mutation were observed to be more dominant and had higher transmissibility than those with Asp614 by enhancing ACE2 binding affinity [31,96].

Correlation Analysis
A correlation matrix was generated for the SARS-CoV-2 variants using their AACs ( Figure 3 and Table S2). The correlation coefficients among the variants ranged between 0.999895 and 1, supporting the fact that all the SARS-CoV-2 variants share a high degree of closeness (Table S2) (Table S2). This result is consistent with the determined amino acid frequencies and compositions (Tables 2 and 3 (Table S2). BA.4 was observed to be the least correlated to the mean with a coefficient of 0.999958 (Table S2). A correlation plot of the 20 standard amino acids was also generated to investigate the relationship among the residues in all the SARS-CoV-2 proteomes (Figure 4 and Table  S3). Correlation coefficients closer to ±1 signify strong correlation [101]. Thus, values greater than 0.7 or less than −0.7 were considered in this study. A previous study that investigated the correlation among amino acids in 204 proteins argued that a correlation coefficient of 0.5 could be considered as strongly correlated since data on structural classes of proteins are a fuzzy set [102]. Herein, the p-values were also determined to ascertain whether the correlations are statistically significant. At p < 0.001, only four correlation relationships [Cys-Gly, Ile-Thr, Asn-Phe (negative correlations) and Cys-Ile (positive correlation)] were observed ( Figure 4B). At p < 0.01, a total of 13 relationships existed among the amino acid residues ( Figure 4C). Residue pairs Gly-Thr, Ser-Asn, Lys-Trp, Glu-Val, and Cys-Ile were strongly correlated at p < 0.01 while Cys-Leu, Cys-Gly, Cys-Thr, Ile-Leu, Ile-Gly, Ile-Thr, Asn-Phe, and His-Gln were negatively correlated ( Figure 4C). At p < 0.05, a total of 32 correlation relationships were observed among the amino acid residues (Figure 4D).
Threonine had the strongest negative correlation with isoleucine (−0.912) [ Figure 4 and Table S3] implying that any mutation that produces a threonine will most likely result in an isoleucine mutating to another amino acid across the proteome of SARS-CoV-2. Although this does not imply that threonine necessarily mutates to isoleucine (or vice versa), several threonine to isoleucine substitutions have been reported in various SARS-CoV-2 proteins [103]. The threonine to isoleucine mutations have been shown in silico to modify the hydrophobicity and structure [103].
In addition, isoleucine had a very strong negative correlation with leucine (−0.818) and glycine (−0.720) while sharing a very strong positive correlation with cysteine (0.835) [ Table S3]. Similar trends were observed for cysteine, sharing very strong negative corre-  (Table S2). This is not surprising as C.37 has been shown to be an offspring of B.  (Table S2).
A correlation plot of the 20 standard amino acids was also generated to investigate the relationship among the residues in all the SARS-CoV-2 proteomes (Figure 4 and Table S3). Correlation coefficients closer to ±1 signify strong correlation [101]. Thus, values greater than 0.7 or less than −0.7 were considered in this study. A previous study that investigated the correlation among amino acids in 204 proteins argued that a correlation coefficient of 0.5 could be considered as strongly correlated since data on structural classes of proteins are a fuzzy set [102]. Herein, the p-values were also determined to ascertain whether the correlations are statistically significant. At p < 0.001, only four correlation relationships [Cys-Gly, Ile-Thr, Asn-Phe (negative correlations) and Cys-Ile (positive correlation)] were observed ( Figure 4B). At p < 0.01, a total of 13 relationships existed among the amino acid residues ( Figure 4C). Residue pairs Gly-Thr, Ser-Asn, Lys-Trp, Glu-Val, and Cys-Ile were strongly correlated at p < 0.01 while Cys-Leu, Cys-Gly, Cys-Thr, Ile-Leu, Ile-Gly, Ile-Thr, Asn-Phe, and His-Gln were negatively correlated ( Figure 4C). At p < 0.05, a total of 32 correlation relationships were observed among the amino acid residues ( Figure 4D).

Test for Normality Using Shapiro and Predicting Probability Using Z-score
Due to the small variant size (11), the Shapiro-Wilk test was used to determine the normality of the distributions of the various AACs. Normality of the distributions will enable the use of z-scores in predicting the AACs (and possibly frequencies) in future variants or lineages. The skewness and kurtosis of the amino acid residue distributions were also determined ( Table 4). All the amino acid residues except methionine had p-values greater than 0.05 (Table 4). Met was observed to depart significantly from normality (W = 0.728, p-value < 0.05) [ Table 4]. However, at p < 0.001, Met does not show evidence of non-normality. Met also had the greatest skewness and kurtosis with values of 2.860 and 2.968, respectively. Serine had the most normal distribution (W = 0.973, p-value = 0.922) followed by histidine (W = 0.968, p-value = 0.865) and glutamine (W = 0.947, p-value = 0.610) [ Table 4]. Valine (W = 0.922, p-value = 0.336) and lysine (W = 0.938, p-value = 0.496) had the least skewness of 0.040 and 0.050, respectively (Table 4). Threonine had the strongest negative correlation with isoleucine (−0.912) [ Figure 4 and Table S3] implying that any mutation that produces a threonine will most likely result in an isoleucine mutating to another amino acid across the proteome of SARS-CoV-2. Although this does not imply that threonine necessarily mutates to isoleucine (or vice versa), several threonine to isoleucine substitutions have been reported in various SARS-CoV-2 proteins [103]. The threonine to isoleucine mutations have been shown in silico to modify the hydrophobicity and structure [103].

Test for Normality Using Shapiro and Predicting Probability Using Z-Score
Due to the small variant size (11), the Shapiro-Wilk test was used to determine the normality of the distributions of the various AACs. Normality of the distributions will enable the use of z-scores in predicting the AACs (and possibly frequencies) in future variants or lineages. The skewness and kurtosis of the amino acid residue distributions were also determined ( Table 4). All the amino acid residues except methionine had p-values greater than 0.05 (Table 4). Met was observed to depart significantly from normality (W = 0.728, p-value < 0.05) [ Table 4]. However, at p < 0.001, Met does not show evidence of non-normality. Met also had the greatest skewness and kurtosis with values of 2.860 and 2.968, respectively. Serine had the most normal distribution (W = 0.973, p-value = 0.922) followed by histidine (W = 0.968, p-value = 0.865) and glutamine (W = 0.947, p-value = 0.610) [ Table 4]. Valine (W = 0.922, p-value = 0.336) and lysine (W = 0.938, p-value = 0.496) had the least skewness of 0.040 and 0.050, respectively (Table 4). Normal probability plots were also generated to analyze how the AAC data deviate from a theoretical distribution (Figures 5 and S3). The probability plot, similar to the quantile-quantile (Q-Q) plot, shows the distribution of the data in comparison to the expected normal distribution. The data is expected to lie along a straight line provided the data is normally distributed. However, if the data is non-normal, a curve is observed, deviating from the straight line. Outliers are the data points that can be seen distant from majority of the points or further away from the straight line. The P-P plot confirmed serine's normality ( Figure 5A) as well as the other amino acids ( Figure S3) except methionine. Methionine was observed not to be normally distributed corroborating the results from the Shapiro-Wilk test ( Figure 5B). However, with a larger data size, methionine may obey the normality tests.
Generally, the normality of the amino acid distribution allows the prediction of AACs of SARS-CoV-2 variants using probability and z-scores. The difference in amino acid composition (variant AAC − mean AAC) ( Figure 6A) and standard scores (z-scores) [ Figure 6B] were determined with regards to the amino acid residues for the various SARS-CoV-2 variants. The difference between the variant AACs and the average AAC ranged between −0.04 and 0.042 ( Figure 6A and Table S4). The highest difference was demonstrated by the isoleucine of variant BA.4, having 0.042% more than the average isoleucine composition. The z-score can be calculated by using Equation (1).
normally distributed. However, if the data is non-normal, a curve is observed, deviating from the straight line. Outliers are the data points that can be seen distant from majority of the points or further away from the straight line. The P-P plot confirmed serine's normality ( Figure 5A) as well as the other amino acids ( Figure S3) except methionine. Methionine was observed not to be normally distributed corroborating the results from the Shapiro-Wilk test ( Figure 5B). However, with a larger data size, methionine may obey the normality tests.
where "std of the AAC" is the standard deviation of a specific amino acid composition (Table S5). Generally, the normality of the amino acid distribution allows the prediction of AACs of SARS-CoV-2 variants using probability and z-scores. The difference in amino acid composition (variant AAC − mean AAC) ( Figure 6A) and standard scores (z-scores) [ Figure  6B] were determined with regards to the amino acid residues for the various SARS-CoV-2 variants. The difference between the variant AACs and the average AAC ranged between −0.04 and 0.042 ( Figure 6A and Table S4). The highest difference was demonstrated by the isoleucine of variant BA.4, having 0.042% more than the average isoleucine composition. The z-score can be calculated by using Equation (1).
where "std of the AAC" is the standard deviation of a specific amino acid composition (Table S5). The z-scores of the 20 AACs for all the 11 variants ranged between −2.7 and 2.1 (Figure 6B) representing 0.3 to 98 % chance. The z-score can help predict the most likely combinations of amino acids in emerging variants. For example, from the above equation, there is a 99.29% chance of having a variant with 1.112% Trp since z-score will be 2.448 (~2.45). In addition, there will be a 0.59% chance of having a SARS-CoV-2 variant with 1.108% tryptophan composition since the z-score will be equal to −2.523 (~−2.52). This is not surprising as previous studies have shown that mutating Trp in the SARS-CoV-2 makes the virus highly unstable [87,88]. Moreover, 1.12% is closer to the average AAC (1.11) as compared to 1.08%. The AACs can be beneficial in predicting future viral proteomes which can help in drug discovery, vaccine and protein-protein interaction studies.

Validation of Correlation Analysis
A multiple template alignment of 28 proteome sequences from the 11 variants/lineages used herein were performed and analyzed using webPRANK [66]. The variants which had more than two or three samples in the alignment are labelled as "A", "B" or "C" in addition to the variant/lineage name to differentiate among them. The lengths of the 28 sequences from the 11 variants/lineages ranged from 14,026 to 14,149 residues while the sequence matrix length was 14,155 columns. The ORF1ab spanned from position 1 to 7096, while ORF1a was from 7097 to 11,501. The spike protein followed to position 12,780 and the ORF3a was from 12,781 to 13,055. The envelope protein's alignment ended at 13,130, followed by the membrane protein which ended at 13,352, and then ORF6 from 13,353 to 13,413. The alignments of ORF7a and ORF7b ended at positions 13,534 and The z-scores of the 20 AACs for all the 11 variants ranged between −2.7 and 2.1 ( Figure 6B) representing 0.3 to 98 % chance. The z-score can help predict the most likely combinations of amino acids in emerging variants. For example, from the above equation, there is a 99.29% chance of having a variant with 1.112% Trp since z-score will be 2.448 (~2.45). In addition, there will be a 0.59% chance of having a SARS-CoV-2 variant with 1.108% tryptophan composition since the z-score will be equal to −2.523 (~−2.52). This is not surprising as previous studies have shown that mutating Trp in the SARS-CoV-2 makes the virus highly unstable [87,88]. Moreover, 1.12% is closer to the average AAC (1.11) as compared to 1.08%. The AACs can be beneficial in predicting future viral proteomes which can help in drug discovery, vaccine and protein-protein interaction studies.

Validation of Correlation Analysis
A multiple template alignment of 28 proteome sequences from the 11 variants/lineages used herein were performed and analyzed using webPRANK [66]. The variants which had more than two or three samples in the alignment are labelled as "A", "B" or "C" in addition to the variant/lineage name to differentiate among them. The lengths of the 28 sequences from the 11 variants/lineages ranged from 14,026 to 14,149 residues while the sequence matrix length was 14,155 columns. The ORF1ab spanned from position 1 to 7096, while ORF1a was from 7097 to 11,501. The spike protein followed to position 12,780 and the ORF3a was from 12,781 to 13,055. The envelope protein's alignment ended at 13,130, followed by the membrane protein which ended at 13,352, and then ORF6 from 13,353 to 13,413 The B.1.617.2C is an ORF8-deleted sample. ORF8-deleted SARS-CoV-2 variants have been shown clinically to have higher transmissibility with milder disease outcomes [105][106][107][108][109]. The ORF8 has been suggested to be involved in pathogenesis and not viral genome replication [110]. Samples B.1.1.7A, B.1.1.7C, and P.1A were also observed to have partial deletions of the ORF8, from 27Q till the end of the ORF8. Deletions affecting the length of the ORF8 is not strange as they have been previously identified [109]. The 119DF120 deletion which has been mentioned in an earlier study [111], was also observed for B. From the alignment of the 28 samples, Trp was observed to be involved in only one mutation (W131L of ORF3a protein) for B.1.351A corroborating the AAC results which suggest that Trp is less likely to mutate (Table 3). Aside W131L, Trp131 has also been reported to mutate to other forms including W131S, W131R, and W131V [112,113]. Herein, all the other 27 samples used for the alignment demonstrated no Trp mutations.
A total of 74 mutations involving Thr or Ile were recorded, with 49 of these being direct Thr-to-Ile or Ile-to-Thr mutations, making up 66.22% of the total instances. Interestingly, there were other indirect mutations involving Thr and Ile. Aside Thr, Ile was observed to only mutate to/from three residues comprising of Met, Val, and Leu. Across the complete proteome, isoleucine was observed to share 5, 4, and 2 mutations with Val, Met, and Leu, respectively. Threonine was also observed to share 6, 4, 3, and 1 mutation(s) with Ala, Asn, Lys, and Arg, respectively. From the earlier correlogram, Thr was predicted to have a negative correlation with Ala and Lys at p < 0.05 ( Figure 4D), consistent with the alignment results. For the ORF1ab protein, 20 out of 27 (74.07%) mutation instances were Thr-to-Ile or Ile-to-Thr. For the envelope, ORF7a, ORF7b and ORF8 proteins, only one mutation each was observed involving Thr or Ile and they were directly linked. For the N protein, 3 out of 5 mutations (60%) involving Thr or Ile were directly from Thr to Ile.
Cys was observed to mutate to and from Gly in all samples except for BA.1C, BA.4A and BA.4B. In the N protein, G214C (all three C.37 samples) and G215C (BA.1A and B.1.617.2A) mutations were present while C316G (BA.1B) mutation was observed in the ORF1a. R5716C in the ORF1ab (same as R392C in nsp13) [114,115] was observed for BA.4A and BA.4B. In addition, C655Y is observed in both ORF1ab and ORF1a for BA.1C. The results from the alignment show that there is a 66% chance that a Cys will be formed by a mutation from a Gly, as only one out of the three instances Cys was formed by another residue (Arg). A larger data set is required to confirm these findings. However, the same cannot be said for Gly since Gly was observed to also mutate to/from Asp, Ser, Arg, Asn, Val, and Ala. The correlogram also predicted Gln and His to be negatively correlated. From the alignment analysis, Gln and His shared 7 mutations: two in ORF1ab (Q676H and Q5412H), one in ORF1a (Q676H), two in spike (Q683H and Q960H), one in ORF3a (Q57H) and one in the N protein (Q9H). Gln shared 5, 3, 1, and 1 mutation(s) with Arg, Leu, Glu, and Lys, respectively. His was also observed to share 6, 3, and 2 mutations with Tyr, Pro, and Asp, respectively. Although the correlation analysis predicted a strong negative correlation between Asn and Phe (Figure 4), there were no direct mutations between the two residues. They could probably be indirectly related.

Limitations of the Study
Similar to all other research, this study also has its shortcomings and limitations. The relatively small sample size used herein is a major shortcoming of this study since they may not accurately reflect the overall frequency of each amino acid in the proteomes although the use of AACs could help mitigate this challenge. Furthermore, the limited number of variants used (11) does not provide complete analysis of all SARS-CoV-2 variants. The uneven sample sizes of the variants could also influence the overall accuracy of the results reported herein. Nonetheless, this study reports correlations among the amino acid residues that can be further investigated on a larger sample size to corroborate these findings. Robust and more accurate strategies/methods for validating the correlations must be developed in order to identify or measure indirectly linked residues.
Furthermore, the study did not investigate the correlation existing among the amino acids taking epistasis into consideration. A mutation in one protein can influence another mutation in a different protein of the same organism since proteins often interact with each other to perform specific functions, and changes in one protein can impact the behavior or function of others. Efforts are ongoing to elucidate epistasis in the SARS-CoV-2 genome as this phenomenon is still in its nascent stages for this virus [116,117]. A recent study showed that several pairwise epistases exist among eight viral genes [116]. The impact of one mutation on another can be complex and may depend on a variety of factors, including the specific proteins involved, the context of the mutations, and the environment. Understanding these interactions is important for understanding the evolution of viruses and for developing effective strategies to control the spread of infectious diseases. This molecular mechanism also contributes to the evolution of SARS-CoV-2 and the emergence of new viral variants [116]. Understanding these SARS-CoV-2 mutation mechanisms is important for developing effective strategies to control the spread of COVID-19. Future studies could evaluate the correlations existing among the amino acid residues taking into consideration epistatically linked pairs.

Potential Implications of the Study and Future Perspective
This study showed that relationships exist among the amino acid residues in the SARS-CoV-2 proteome. These relationships in addition to knowledge from molecular mechanisms on viral mutations and environmental factors, can help in designing lab-made mutants for mutational studies. Previous studies have successfully predicted structural classes of proteins using AAC-based models [14,15]. Although predicting viral mutations are complex due to the many factors involved, optimizing machine-learning based models which use up-to-date resources can improve the success rate, which will in turn help in drug and vaccine research and design. The AACs along with evolutionary information can be exploited to predict protein secondary structures [21], and later for protein synthesis, study protein-protein and protein-drug interactions [19,20]. For diagnostics purposes, the AACs can help classify strains or variants since each variant has specific AACs for the amino acid residues.

Conclusions
This study showed that correlation relationships exist among the 20 standard amino acids in the SARS-CoV-2 proteome. Herein, the frequencies of the amino acid residues in 1673 complete samples belonging to 11 SARS-CoV-2 variants were carefully computed and analyzed. The AACs revealed that leucine and tryptophan were the most and least abundant, respectively. Correlation analysis also showed that cysteine and isoleucine have very strong negative correlations with leucine, glycine, and threonine at p < 0.01. Threonine and isoleucine were observed to have the strongest negative correlation of −0.912 at p < 0.001. At p < 0.001, isoleucine and cysteine had a very strong positive correlation of 0.835. Special attention must be given to these amino acids when predicting mutations in the SARS-CoV-2. In addition, we show that the AAC is able to predict the ancestry and relationships among SARS-CoV-2 variants. The AAC analysis and correlations reported herein will aid in predicting and classifying viral strains. The AACs of the variants were also observed to be normally distributed which can be leveraged to predict future proteome contents using probability and z-scores.  Table S1: Average and standard deviation of each amino acid for all the SARS-CoV-2 variants used in this study; Table S2: Correlation among the SARS-CoV-2 variants using their amino acid compositions; Table S3: Correlation matrix among the 20 standard amino acids using their compositions; Table S4: Calculated z-scores of the amino acid residues belonging to the 11 SARS-CoV-2 variants using their compositions. Z-score values are presented in 3 decimal places; Table S5: Average and standard deviation of the amino acid compositions.