Semicovariance Coefficient Analysis of Spike Proteins from SARS-CoV-2 and Other Coronaviruses for Viral Evolution and Characteristics Associated with Fatality

Complex modeling has received significant attention in recent years and is increasingly used to explain statistical phenomena with increasing and decreasing fluctuations, such as the similarity or difference of spike protein charge patterns of coronaviruses. Different from the existing covariance or correlation coefficient methods in traditional integer dimension construction, this study proposes a simplified novel fractional dimension derivation with the exact Excel tool algorithm. It involves the fractional center moment extension to covariance, which results in a complex covariance coefficient that is better than the Pearson correlation coefficient, in the sense that the nonlinearity relationship can be further depicted. The spike protein sequences of coronaviruses were obtained from the GenBank and GISAID databases, including the coronaviruses from pangolin, bat, canine, swine (three variants), feline, tiger, SARS-CoV-1, MERS, and SARS-CoV-2 (including the strains from Wuhan, Beijing, New York, German, and the UK variant B.1.1.7) which were used as the representative examples in this study. By examining the values above and below the average/mean based on the positive and negative charge patterns of the amino acid residues of the spike proteins from coronaviruses, the proposed algorithm provides deep insights into the nonlinear evolving trends of spike proteins for understanding the viral evolution and identifying the protein characteristics associated with viral fatality. The calculation results demonstrate that the complex covariance coefficient analyzed by this algorithm is capable of distinguishing the subtle nonlinear differences in the spike protein charge patterns with reference to Wuhan strain SARS-CoV-2, which the Pearson correlation coefficient may overlook. Our analysis reveals the unique convergent (positive correlative) to divergent (negative correlative) domain center positions of each virus. The convergent or conserved region may be critical to the viral stability or viability; while the divergent region is highly variable between coronaviruses, suggesting high frequency of mutations in this region. The analyses show that the conserved center region of SARS-CoV-1 spike protein is located at amino acid residues 900, but shifted to the amino acid residues 700 in MERS spike protein, and then to amino acid residues 600 in SARS-COV-2 spike protein, indicating the evolution of the coronaviruses. Interestingly, the conserved center region of the spike protein in SARS-COV-2 variant B.1.1.7 shifted back to amino acid residues 700, suggesting this variant is more virulent than the original SARS-COV-2 strain. Another important characteristic our study reveals is that the distance between the divergent mean and the maximal divergent point in each of the viruses (MERS > SARS-CoV-1 > SARS-CoV-2) is proportional to viral fatality rate. This algorithm may help to understand and analyze the evolving trends and critical characteristics of SARS-COV-2 variants, other coronaviral proteins and viruses.


Introduction
Complex algorithms are used to analyze real-world implementations, i.e., it comes as the trusted analytic solution, but typically tends to have challenges in software implementation complexity requiring simpler software solution by using complex theory. Complex algorithms have received significant attention in recent years and are increasingly used to solve real-world problems, among which are the combination of two or more algorithms involving numerical algorithms, analytic calculation [1], and other computational techniques such as artificial intelligence [2][3][4], gene analysis systems [3] or gene simulation [4].
The Fractal DNA hypothesis (FDH) was first introduced by at least three groups independently in 1992 [5]. The arithmetic data from the DNA sequences by counting the number of intervening bases from a specific base (A) to the next one, etc. (inter-event data), are characterized by a dynamical process whereby long-range (fractal) correlations are observed. Being different from the traditional DNA hypothesis, RNA and protein analysis is based on fragment length between the domains with electrical charges. Especially, it emphasizes the influences on the behaviors of charges caused by the difference of information reception and lengths of expression or neighbor status observing the existence of fractal structure in stable DNA [6]. Some work around FDH on RNA has been reported recently [7] where the genetic sequences were converted to binary numbers, purines converted to −1 and pyrimidines converted +1. The dimension order was found to be SARS-CoV-1 > SARS-CoV-2 > MERS, which differs from the time evolution order. Thus, we wish to further examine the similar electrical charge specific (+1 for positive amino acid, −1 for negative counterpart, 0 for neutral one, Inter-event data) relationship among the coronaviruses. The SARS-CoV-2 virus is among the longest positive single-stranded RNA virus, and its protein folding/tertiary structure and functions are closely related to the charges of the amino acid residues. It is therefore important to examine the charging structure/patterns or nonlinear correlation patterns of the spike protein of SARS-CoV-2 as compared to the spike proteins from other coronaviruses to understand viral evolution and the characteristics of the spike proteins associated with viral virulence and fatality.
The FDH does not distinguish the values above or below the mean (average) of the DNA fragment length between the gene signatures. Our algorithm used in this study focuses on distinguishing the values above and below the mean (average) to calculate the semicovariance coefficient of the spike protein sequences from coronaviruses, including SARS-CoV2. The higher value above the mean indicates higher similarity and increased evolutionary conservation, while the lower value below the mean indicates more dissimilarity and increased variations/mutations. Analysis with our algorithm can be carried out rapidly by running the Microsoft Excel sheet tool. In our study, the traditional Pearson correlation coefficient for the spike protein sequences of coronaviruses [8] is also calculated for comparison [9]. By imaging the charge similarity covariance as a weight (gravity or Coulomb force) on the rod of the axis, the weight center of semicovariance coefficient is calculated to examine the evolving weight center (both convergent (positive correlative) center/region and divergent (negative correlative) center/region) shifting pattern of the spike protein sequences of coronaviruses [10] and identify spike proteins' characteristics associated with viral virulence and fatality.

Hypo, Hyper or Gauss Variances and Covariance
The complex parameter is a measure of fluctuation-term memory time series. It relates to the autocorrelation time series, and the derivatives of Laplace transformation (frequency spectrum) time series or the momentum generation function at the origin [11]. Studies involving the complex parameter were originally developed by Jerome Cardan (1501-1576) for solving algebra equations.
In order to calculate the fractional version of the center momentum, we need to first generalize the binomial formula from integer domain to real domain, as described previously [12]. Here, we only show the formula: when k = 2, it is Gauss variance; k < 2 is hypo version; k > 2 is hyper version. Factorial of fractional k is calculated by Gamma function. From which we can also have the covariance counterpart: where ν(real) is the positive covariance, ν(img) is the negative covariance. Define the basic rectified linear unit as ReLU(X) = max(0,X), we can have the simplified semicovariance related back to the traditional Pearson correlation as below: In sum, the Pearson coefficient is the difference between the positive part (the first and the third quadrants) and the negative part (the second and the fourth quadrants) which are the proposed semicovariance coefficients. The product of two differences (the two values above the mean and the two values below the mean) on the same side of the mean value will be the real part, or the convergent part so that we can call it positive correlation covariance coefficient. The product that is on the opposite side of the mean will be the imaginary part, or the divergent part so that we can call it negative correlation covariance coefficient [13].

Excel Calculations of Semicovariance for Spike Proteins from SARS-CoV-2 and Other Coronaviruses
To compare and prove the usefulness of the simplified complex variances, we compare the correlation of SARS-CoV-2 viral spike protein sequence with other coronavirus spike protein sequences [14]. Since Excel is not capable of handling the imaginary number, we simplify the calculation with integer power, but separate the positive and negative covariance signs [15]. As coronaviral spike proteins have different electrical charge levels [16], we normalize the covariance by the variance, respectively just as the Pearson calculation does [17]. We calculated the sequences of the spike proteins [18] and plotted the curve starting from the N-terminus to the C-terminus. By using the moving window (a typical peptide) of 16 neighborhood amino acid residues [19], we calculated the covariance and average/mean over the same period of sequences to make the curve visually smooth for easier comparisons [20]. We selected the window size 16 to maximize the information we can extract from the charge even data, as it was the boundary between oligopeptide and polypeptide for the number of the amino acids.
Here are the steps for Excel sheet calculations: (1) Collect data from GenBank and GISAID, align up the spike position, and note the similarity and difference. (2) Convert the aligned sequence into charge data, i.e., positive is +1; negative is −1; and the rest is 0. We defined the conserved (and diverged) centers or regions of the spike proteins for different coronaviruses. For example, the conserved centers of SARS-CoV-1, pangolin and bat coronaviruses are located at the amino acid residues 905, 906 and 904, respectively. These viral spike proteins are set as the 900 series ( Table 1). The conserved center is defined as the weight center of the spike protein sequence at which the charge pattern before and after those points is the same. The conserved center for SARS-CoV-2 is at the amino acid residue 658, and the conserved center for feline coronavirus is at the amino acid residue 683 (Table 1). Thus, the SARS-CoV-2 and feline coronaviral spike proteins are defined as the 600 series. The conserved centers for MERS and three swine coronaviruses range from amino acid residues 727 to 784, hence they are defined as the 700 series. Therefore, as the latest SARS-CoV-2 variant, B.1.1.7 from the UK, has its conserved center at 702, this suggests that the charge pattern of the variant B.1.1.7 spike protein has evolved to 700 series from 600 series and that the variant B.1.1.7 may be more virulent or deadly than the original SARS-COV-2 strain.
Figures 1-6 are the calculation results from our algorithm of semicovariance coefficient for spike protein of the Wuhan SARS-CoV-2 strain in comparison with spike proteins of other coronaviruses listed in Table 1. Figures 7-10 are the corresponding scatter graphs. Figure 11 illustrates the center and maximum positions listed in Table 2.
The spike protein sequences were analyzed with index order from animal coronaviruses (pangolin, bat, canine, swine, feline, and tiger) and human coronaviruses (SARS-CoV-1, MERS, and SARS-CoV-2) [21]. Figure 1 presents the calculation results of the spike proteins for human coronaviruses including 600 (SARS-CoV-2)//700 (MERS)//900 (SARS-CoV-1) series of spike proteins semicovariance selected based on Table 1. Figure 2 presents the analysis for coronaviruses whose conserved center is located from the spike protein amino acid residues 900 to 999 (900 series) as described above. Figure 3 is for 800 and 700 series. Figure 4 is for 700 and 600 series. Figure 5 is for 600 series. Figure 6 is for 600 and 700 series. The spike protein sequence of Wuhan SARS-CoV-2 strain is used as a reference for comparison with the spike proteins of SARS-CoV-2 viral strain isolated from Beijing Xinfadi wholesale market (carrying D614G mutation), SARS-CoV-1, and MERS. Mathematically speaking, the diagram/curve above the X-axis is the positive correlation (convergent or conv in the figure). The higher the value, the greater the similarity of charge patterns between the compared viral spike proteins. SARS-CoV2 strain isolated in Beijing Xinfadi wholesale market is the same as the SARS-CoV-2 strain isolated in Wuhan throughout the entire sequence except D614G mutation. SARS-CoV-1 shows a similar pattern with SARS-CoV-2 after amino acid residue 700; while MERS shows a similar pattern with SARS-CoV-2 only around the amino acid sequence 1000. The second similar region of the MERS spike protein sequence with SARS-CoV-2 lies around amino acid residue 200. The diagram/curve below the X-axis is the negative correlation (divergent or dive in the figure). The lower the value, the more oppositely charged, thus the greater the disdissimilarity between the compared viruses. MERS has more opposite charges around amino acid position 200, 800 and 1200 as compared to SARS-CoV-2; while SARS-CoV-1 has a few opposite charges around amino acid position 200 and 450 as compared to SARS-CoV-2, indicating more similarity between the spike proteins from SARS-CoV-1 and SARS-CoV-2 but not between SARS-CoV-2 and MERS.

Figure 2.
Semicovariance coefficient of SARS-CoV-2 spike protein with the spike proteins from SARS-CoV-1, pangolin and bat coronaviruses (900 series). The diagram/curve above the X-axis is positive correlation (convergent or conv in the figure). The higher the value, the greater the similarity of the charge patterns between the compared viruses. The spike protein of SARS-CoV-1 is similar to the spike protein of Wuhan SARS-CoV-2 from amino acid residues 700 onwards, as well as the spike proteins from pangolin and bat coronavirus. The spike proteins of SARS-CoV-1, bat and pangolin coronaviruses overlap each other more after amino acid residues 700. The diagram/curve below the X-axis is negative correlation (divergent or dive in the figure). The lower the value, the more oppositely charged, thus the more dissimilarity between the compared viruses. The spike proteins of the pangolin and the bat peak around position 200, suggesting that the charge pattern is not similar at this region or the variations/mutations occurred more at this region between SARS-CoV-2 and the pangolin/bat coronaviruses. Similarly, SARS-CoV-1 peaks around 200 and 450 amino acid residue positions, suggesting that the charge patterns are different between SARS-CoV-1 and SARS-CoV-2 at this region or the mutations have made this region different between the two viruses.    figure). The higher the value, the greater the similarity of charge patterns between the compared viruses. SARS-CoV-2 isolated in Beijing Xinfadi wholesale market, New York, Germany and the New York zoo tiger overlap and are almost identical to Wuhan SARS-CoV-2. The diagram/curve below the X-axis is the negative correlation (divergent or dive in the figure). The lower the value, the more oppositely charged and the greater the dissimilarity between the compared viruses are. SARS-CoV-2 spike proteins from Beijing, New York, Germany and the New York zoo tiger carry D614G mutation and have the only opposite charge at amino acid residue 614 position (D614G mutation as reported in the literature). and Germany are almost identical to Wuhan SARS-CoV-2, except the beginning part of B.1.1.7 is mutated back to SARS-COV-1. The diagram/curve below the X-axis is the negative correlation (divergent or dive in the figure). The lower the value, the more oppositely charged and the greater the dissimilarity between the compared viruses are. The UK variant B.1.1.7 has not only the opposite charge around 600 amino acid residue position (D614G mutation as reported in the literature) but also at the old one around 100 (SARS-CoV-1) and at a new position (1000). The mutation sites occur towards both sides of the 600 series. It flips back more like bat coronavirus as well. The gaps between the mutation sites are as follows: 69 = 3 × 23; 73; 355 = 3 × 71; 69 = 3 × 23; 44 = 2 × 2 × 11; 67; 35 = 5 × 7; 266 = 2 × 7 × 19; 136 = 2 × 2 × 2 × 17; 155 = 5 × 31. There are three groups of prime numbers involved. The first group is 2,3,5, which belongs to cusps modular (Langlands) prime number. It might be related to the fractal shell-like growing structure. The second group is 7, 11, 19, 23, which belongs to 4k + 3 prime number, also called the Gaussian prime number. The latter is a closed field number on a complex plane, meaning that the numbers form a total ordered chain. It might be attribute to the 3D chain structure of the spike protein. The third group is 31, 67, 71, 73, which belongs to the prime numbers of binary digits. It might be attributed to the long folding structure of the protein.       Figures 1-6 for easier comparison) that obtains the peak value. The divergent center is defined as the position where the center of the charge (related to Coulomb force) of the entire sequence landed on (as if all the charges are originated from that one point). The distance (338−309 = 29) between these two positions (after beingbeen normalized) is proportional to the shift distance (18) between SARS-CoV-1 and the SARS-CoV-2 (Wuhan baseline). It is also proportional to the fatality (9.56%) of the virus (SARS-CoV-1) as seen in Table 2.  Table 1 compares the Pearson correlation coefficient analysis with semicovariance coefficient analysis for coronaviral spike proteins. From Table 1, it shows that the Pearson correlation coefficient only reflects the variation after the cancellation of up and down correlation [22]; however, our semicovariance coefficient reflects the direction of the variations before the cancellation of correlation [23].

Pearson and Semicovariance Coefficient Analysis of Spike Proteins from Coronaviruses
To visually examine the nonlinear correlation relationship between Wuhan strain SARS-CoV-2 and the rest of the coronaviruses, we further plotted scatter graphs where the X-axis is the charge variation over the 16 neighborhood amino average of the Wuhan strain of SARS-CoV-2, and the Y-axis is the charge variation over the 16 neighborhood amino average of the respective virus (Figures 7-10). It can be seen that some of them have nonlinear relationships. The Pearson correlation may not be good enough to depict all of them. The second quadrant and fourth quadrant represent the strong mutation part where the charge is reversed, the first quadrant and the third quadrant are the weak mutation part where the charge is not reversed (Figures 7-10).
It can be seen that there are a combination of linear and nonlinear relationships in Figure 7A,B; while Figure 7C shows a linear relationship. Figure 8A-C show a linear relationship for the scatter patterns of the spike proteins from the viral strains isolated in German, New York and Beijing relative to that of the Wuhan strain, indicating a high similarity. Figure 7C is identical to those of Figure 8A-C, suggesting that the same strain of the SARS-CoV-2 was transmitted from human to the New York Zoo tiger. All of these viral strains carry the D614G mutation in the spike protein. However, Figure 8D shows the UK variant B.1.1.7 vs. the Wuhan strain and there is a combination of linear and nonlinear relationships between them, indicating that the mutations in B.1.1.7 result in amino acid changes with opposite charges as compared to the Wuhan strain. Figure 9A shows a combination of linear and nonlinear relationships between SARS-CoV-1 and Wuhan strain SARS-CoV-2 spike proteins; while Figure 9B shows a nonlinear relationship between MERS and Wuhan strain SARS-CoV-2 spike proteins, indicating a strong dissimilarity between them. Figure 10 shows nonlinear relationships between those animal viral spike proteins and the Wuhan strain SARS-CoV-2 spike protein, indicating a strong dissimilarity between them. However, the local piecewise similarity island pattern is still clearly seen. That means they are still related somehow. Table 2 uses the spike protein sequence from Wuhan SARS-CoV-2 strain as a reference to compare with SARS-CoV-2 strain isolated in Beijing (carrying D614G mutation), SARS-CoV-1, and MERS. It incorporates fatality rates to identify critical amino acid regions associated with mortality. As compared to SARS-CoV-2, there are 74 amino acid residues in MERS spike protein sequence that are critical to MERS-associated fatality, and there are 18 amino acid residues that are associated with SARS-CoV-1 fatality. There are only nine amino acid residues in the Beijing strain viral spike protein that are different from Wuhan strain SARS-CoV-2 spike protein. The correlation coefficient of the analysis for these critical amino acids in the spike proteins associated with fatality is R = 0.9981 among the three coronaviruses infecting humans. The similar calculation for the ratio of Mutation Coulomb force center to maximum Coulomb force point leads to R = 0.9958. Divergent Coulomb intensity dictates the fatality. The diverged center (imaginary part) is illustrated in Figure 11; the same definition applies to the conserved (real) part. The definition of the maximum and the center of the divergence is illustrated in the same figure.

Discussion
This study presents the construction of a complex covariance for the fractional analysis of coronavirus spike proteins by using a fractional moment based simple algorithm coded in an Excel Sheet. The analysis with our novel complex model reveals an additional performance index over the traditional real model, such as the Pearson correlation coefficient. Our model compares the traditional Pearson calculation of the integer dimension against the fractional dimension. The complex calculation shows the differences among viral spike proteins, which the traditional covariance definition and calculation may overlook. Our study reveals the unique convergent (positive correlative) to divergent (negative correlative) centers of each virus and the distance/length between the positive-and negative-correlative centers/regions (Table 1). Interestingly, we found that the distance between divergent center (mean) and the maximal divergent point is associated with viral fatality. As compared to the SARS-CoV-2 strain isolated in Wuhan, the distance between the divergent center (mean) and the maximal divergent point is located at the amino acid residue 614 in the SARS-CoV-2 viral strains isolated in Beijing, Germany, New York and New York Zoo tiger. This suggests those viruses are essentially the same except at amino acid 614 (D614G mutation, aspartate (D) to glycine (G)) also reported in the literature [24,25]. While the distance between the divergent center (mean) and the maximal divergent point in the spike protein of SARS-CoV-1 is from the amino acid residues 309 to 338 (Table 1), the distance between the divergent center (mean) and the maximal divergent point in the spike protein of MERS is from the amino acid residue 214 to 698 as compared to the spike protein of Wuhan strain SARS-CoV-2 (Table 1). It is evident that the fatality rate caused by the virus is highly related to the distance between the divergent center (mean Coulomb force) and the maximal divergent Coulomb (force) point ( Table 2). The longer the distance, the more mutations (Coulomb force) and the more deadly and virulent the virus is. This region of the MERS spike protein occurs with a high frequency of variations as compared with the spike proteins from other coronaviruses and may be responsible for its high fatality.
From Table 1, it is shown that our complex coefficient reveals more dependency and trends of each protein sequence's evolution [26]. In the past, the viral spike protein's conserved center evolved from the amino acid residue 900 in SARS-CoV-1 down to 600 in SARS-CoV-2. The conserved region or convergent center may be critical to the viral stability or viability. This conserved center/region of the viral spike protein has been shifted from SARS-CoV-1 at the amino acid residue 900 to amino acid residue 700 in MERS spike protein, and then shifted to amino acid residue 600 in SARS-CoV-2. The charge pattern of the SARS-CoV-1 spike protein sequence around 900 is similar to that of the MERS spike protein around 700, and similar to that of the SARS-CoV-2 around amino acid residue 600. Interestingly, the convergent center of the UK variant B.1.1.7 is shifted from 600 in SARS-CoV-2 strains (Wuhan, German, New York, and Beijing strains) to 700 ( Table 1). The convergent center of the UK variant B.1.1.7 spike protein in 700 is similar to those of MERS and swine coronaviral spike proteins (Table 1), which may indicate greater lethality as compared with the SARS-CoV2 strains isolated in Wuhan, German, New York, and Beijing. Our analysis suggests that the conserved center/region may be essential for the biology, viability, and evolution of the coronaviruses. This conserved center/region may shift to a new location in new SARS-COV-2 variants or other novel coronaviruses.

Conclusions
In this study, we have analyzed spike protein charge patterns of coronaviruses by using our algorithm of semicovariance (nonlinear) coefficient as compared to the Pearson (linear) correlation coefficient, (based on the original semivariance principle [27] for risk analysis initiated by 1990 Nobel Prize winner Harry M. Markowitz). The analysis reveals an additional performance index over the Pearson analysis, such as both positive-and negative-correlative centers/regions in the spike proteins. The study reveals that the distance between the divergent center (mean) and the maximal divergent point is associated with viral fatality. The longer the distance is, the more variations/mutations are, and the more deadly the virus is. The correlation coefficient analysis also identifies the critical amino acids in the spike proteins associated with fatality among the three coronaviruses infecting humans. Our study suggests that the conserved center/region of spike proteins identified by the analysis is essential for the biology and viability of the viruses and that the shifting of this region involves the evolution of the viruses. The conserved center/region of the UK variant B.1.1.7 has shifted to the MERS category of viruses, indicating more virulence of this variant. In addition, the analysis provides an in-depth understanding for the nonlinear viral evolution pattern and identifies the protein Coulomb force characteristics which may be associated with viral fatality.
It is envisioned that this complex number model is a good alternative for the covariance analysis of coronaviral spike proteins. This type of analysis may go beyond asymmetrical fluctuations to help in developing high dimensional Fractal theory. However, the simplified calculation is easier for practical analysis and applications. The simplified Excel sheet calculation is very easy to use, accurate and forward compatible with traditional Pearson model and calculations. The example code is available from the Excel file on the GitHub server (https://github.com/steedhuang/COVID-19-gene-convertor) accessed on 21 April 2021. Our future work will look into other viral proteins with the same methodology for viral evolution and the Coulomb characteristics that are associated with viral fatality. More attention will be paid to the relationship between positive charges and infectivity. Additionally, we will use an unsupervised machine learning algorithm such as k-means classification to find the optimum moving window size (for peptide) and to predict the next mutation spot (region), and use a supervised machine learning algorithm such as Recurrent Neural Network (RNN) to find the potential upcoming virus variant(s) emerging time [28]. These analyses may allow the vaccine and antibody therapies to be prepared ahead of time before the new variants appear.