A total of 105,276 worldwide SARS-CoV-2 partial and complete sequences from 117 countries were downloaded from GISAID corresponding epiweek 52, 2019 and epiweeks 1 to 37, 2020. After bioinformatics processing, we recovered and studied the genetic variability of 101,100 spike, 101,376 envelope, 103,419 membrane, and 99,675 nucleocapsid complete sequences. The epiweeks with available sequences varied across the geographic regions established according to GISAID classification (Table S1
). Only Asia presented sequences in the 38 epiweeks under study, followed by Europe and North America (34 epiweeks each), Oceania (33 epiweeks), Africa (28 epiweeks), and South America (26 epiweeks). The number of analyzed sequences per protein and geographic region, the total number of mutated sequences, and the number, frequency, and nature of mutated residues in each protein are described in Table S2
. The number of sequences available per country in each geographic region for each protein is listed in Table S3
. The number of countries providing structural protein sequences in GISAID was 42 in Europe, 32 in Asia, 20 in Africa, 11 in North America (including Central America and the Caribbean), 9 in South America, and 3 in Oceania. Figure S1
shows the number of global and regional sequences with aa changes across S, E, M, and N SARS-CoV-2 complete structural proteins. Figure 1
shows the percentage of global sequences with aa changes for each residue across the four structural SARS-CoV-2 proteins and their location within the protein domains.
3.1. Spike Protein (S)
Global S aa conservation was 99.97%. Among the 101,100 analyzed sequences, 2671 (2.6%) aa changes were found in 1132 (88.9%) of 1273 spike residues (Figure S1A
). The most frequent aa change was D614G (81.5%, 82,183 sequences), located in S1 (Figure 1
A), followed by S477N (4.1%), located in the receptor binding motif of the receptor binding domain (Figure 1
A). All other substitutions had a frequency below 1%. D614G was also the most frequent aa substitution in all regions as follows: Africa (94.7%), Asia (57.7%), Europe (82.9%), North America (84.3%), South America (93%), and Oceania (82%). D614G was found for the first time in epiweek 4 in Asia in two (1%) Chinese sequences, and in Oceania in one (11%) Australian sequence. In Europe, D614G appeared in epiweek 5 for the first time, mainly in Germany (41%), and in North America (31%), in three Canadian sequences. The last regions where this mutation appeared, in epiweek 9, were Africa (in three sequences from Nigeria, Senegal, and Morocco) and South America (in three sequences from Brazil). In epiweek 10, more than half (54.7%) of the total sequences showed this change, increasing to 97.9% in epiweek 37 (Figure 2
S477N in the S protein was present in all the geographic regions, but mainly in Oceania (56.8%, 3851 Australian sequences), where its frequency rose from 6% (epiweek 20) to 100% (epiweek 31). In the regional analysis, V1176F aa change stood out in South America (18.2%, 286 sequences), with all sequences but one belonging to Brazil.
3.2. Envelope Protein (E)
The global E aa conservation was 99.98%. Among the 101,376 E sequences analyzed, we found 142 aa changes in 65 (86.6%) positions of the 75 E residues (Figure S1B
). All mutations were extremely infrequent, present in less than 0.3% of the total sequences (Figure 1
B). The most prevalent aa change found in E was S68F, present in 221 (0.2%) global sequences, followed by L73F (122 sequences), R69I (92), P71L (68), T9I (56), and V62F (52), all with a frequency of 0.1%.
S68F was present in all regions except in South America, and mainly in Europe (86%, 177 sequences), specifically England (68.9%) where its frequency raised from epiweek 12 (0.6%) to epiweek 19 (3%), decreasing to 0.2% in the last epiweek available. This aa change first appeared in Oceania (epiweek 10), and later in Asia and Europe (epiweek 12), North America (epiweek 13), and Africa (epiweek 21). Most sequences with T9I, R69I, P71L, and L73F belonged to Europe. V62F was present in 70% of sequences from the USA in North America, where frequency increased in epiweeks 22 and 23 but dropped later.
In the analysis by geographic region, the mutation rate was less than 1% in all regions except in Africa, where V5F (1.5%) was present in 36 sequences (35 from Egypt), 92% of them belonging to the last epiweeks with African sequences available (33 and 34). In the analysis by epiweek, no steady increase over time globally or regionally was observed.
3.3. Membrane (M) Protein
The 103,419 M sequences analyzed presented 99.99% conservation, with 291 aa changes found in 165 (74.3%) positions of the 222 M aa (Figure S1C
). Most changes had a very low frequency (≤0.2%), except for D3G (0.7%, 724 sequences), and T175M (1%, 1026 sequences) (Figure 1
C). D3G was the most frequent change in Africa (3.4%) and South America (2.8%) and T175M in Europe (1.6%). However, both aa changes were present in the six geographic regions. Neither aa change showed a steady increase over time globally or regionally. D3G first appeared in European sequences (Lithuania) in epiweek 5 and was not detected in other regions until epiweek 10 (North and South America) and 11 (Asia, Africa, and Oceania). T175M was detected for the first time in epiweek 9 in Europe (England and Netherlands), and later in Asia and South America (epiweek 10), North America and Oceania (epiweek 11) and, lastly, in Africa (epiweek 12).
Globally, we observed a significant change over time in the following six aa substitutions in the M protein: A2S, L17I, D209Y, H125Y, V23L, and V60L. Change A2S increased from 0.2% in epiweek 24 to 1.4% in epiweek 30, mainly due to Australian sequences, but no further increase was observed after this epiweek. Aa change L17I increased from 0.5% in epiweek 30 to 2.8% in epiweek 32, dropping its frequency in the last epiweeks. The increase was due to English sequences. Change D209Y showed a localized increase in frequency (from 0.2% to 1.1%) in epiweek 26. This increase was mainly due to sequences from the USA, but no further increase in global or American sequences was observed after week 27. H125Y increased during the last epiweeks available, from 0.4 in epiweek 31 to 1.3% in epiweek 34, mainly due to UK sequences, specifically English and Scottish.
V23L increased from 0.4% (epiweek 19) to 1.4% (epiweek 22). This aa change was mainly present in the UK and the increase was due to sequences belonging to Wales. Lastly, V60L frequency increased around epiweeks 27 and 28 (1.6 and 1.2%) due to European sequences, specifically from England and Switzerland, decreasing later and rising again in epiweek 34 (1.4%), mainly due to sequences from Scotland and Switzerland.
3.4. Nucleocapsid (N) Protein
A total of 99,657 N worldwide sequences were analyzed, finding 890 aa changes in 359 (85.7%) of the 419 aa residues in the N protein (Figure S1D
). The global aa conservation was 99.77%, slightly lower than the other structural proteins. Although most mutations had a low frequency, some positions showed mutations present in more than 1% of the total global sequences. It was the case for S197L (1.7%, 1686 sequences, 56% from Spain), P13L (1.8%, 1782 sequences, 62% from India and Singapore and 21% from Australia), D103Y (1.9%, 1863 sequences, 89% from England), S194L (3.2%, 3194 sequences, 39% from England and Scotland, 29% from the USA, and 11% from India), and G204R (37%, 36,598 sequences) and R203K (37.3%, 36,876 sequences) with the highest global frequency.
G204R and R203K, both located in the SR-linker (Figure 1
D), tended to appear simultaneously in N protein and were the most frequent aa changes in the following six geographic regions: Africa (55.7%), Asia, (26.8%), Europe (44.1%), North America (12%), South America (60.4%), and Oceania (65.9%). The G204R and R203K combination was first detected in epiweek 5 in three German sequences, then in epiweek 8 in Nigeria, epiweek 9 in Mexico and the USA, and epiweek 10 in Asia, Oceania, and South America. The ANOVA test showed significant differences between the aa combinations in positions 203 and 204 (p
< 0.05). When comparing pairs of possible aa combinations in these positions (Sidak test), only R203 + G204 (aa in Wuhan reference sequence NC 045512.2) and K203 + R204 (most frequent combination) showed significant differences (p
< 0.05) with all the other present combinations (MG, KG, KL, SG, RR, IG, RV, KQ, GG, GR, KT, and NR).
The global rate of the G204R and R203K combination rose from 23% in epiweek 10 to 81% in epiweek 30, dropping to 16% in epiweek 37. Figure 3
shows the occurrence of both aa changes in N by epiweek, globally, and in the six studied geographical regions. The exponential regression (Figure 4
) performed in all the nucleocapsid available sequences, showed an overall decrease of RG combination over time (b = −0.02, Y = 109.7 × (e^(−0.0264 × epiweek)) or ln(Y) = ln(109.7) + (−0.0264 × epiweek), R2 = 88.7%) and an overall increase over time of KR combination (b = 0.03, Y = 19 × (e^(0.0343 × epiweek)) or ln(Y) = ln(19) + (0.0343 × epiweek), R2 = 73.2%). Nevertheless, due to the overrepresentation of European sequences in the total set of N sequences (Table S3
), the regional trend of this combination showed different behaviors in the other five regions (Figure 3
), with important frequency fluctuations over time. To analyze if these regional fluctuations were related to the country of origin of the sequences or the uneven distribution of the sequences from each country along the epiweeks, the statistical average of the R203K and G204R combination was analyzed between epiweeks and countries. The number and frequency of sequences carrying the R203K and G204R combination in N protein by epiweek in each geographic region and country are described in Table S4
, the aa combinations for positions 203 and 204 in each region and epiweek used for the scatter plots are available in Table S5
In North America, the frequency increased until epiweek 23 (48.7%), and then decreased until epiweek 26, stabilizing at a frequency of around 20%. As 92% of the North American N sequences belonged to the USA (Table S3
), this regional curve probably describes what happened only in this country, where the R203K and G204R combination frequency reached ≈50% in epiweek 23, and then dropped to ≈20% in the following epiweeks, except for an isolated increase to 38% in epiweek 36. In Canada, the second country in this region with the most sequences, the aa combination had a median rate of 20% until epiweek 18, increasing to 73% in epiweek 21 (last epiweek with more than 10 sequences in Canada). The absence or a low number of N sequences per epiweek in the remaining North American countries excluded them for a complete similar analysis. However, when comparing data from the available epiweeks with more than 10 sequences, we also observed an increase in the frequency of the R203K and G204R combination in N sequences from Costa Rica (from 7.4% in epiweek 12 to 69.2% in epiweek 27) and Mexico (from 9.5% in epiweek 21 to 63.6% in epiweek 32); the global frequency of that combination in both countries was 28.1% and 19.3%, respectively (Table S4
In Europe, the R203K and G204R combination steadily increased until ≈85% around epiweek 30, decreasing to 3.2% in epiweek 37, where most of the sequences belonged to Wales. Most of the European sequences (72%) belonged to the UK, mainly to England, where these aa changes increased to 89% until epiweek 31, decreasing to 57% in epiweek 36 (last epiweek that met our criteria). Similarly, in Wales, the frequency raised until epiweek 33 (91%) dropping later (4% in epiweek 37), as in Scotland (93% in epiweek 32 and 29% in epiweek 36), but not in Northern Ireland (71% in epiweek 35). Although the available N sequences differed across European countries and epiweeks, this same increase–decrease tendency was observed in Italy, Denmark, and Switzerland, whereas in other countries the R203K and G204R combination frequency increased over time (as in Netherlands, Spain, and Sweden), or only in the last available epiweeks with 10 sequences (as in Germany). A steady increase in frequency was observed in Portugal, France, and Russia, whereas in Austria the frequency remained stable and in Belgium, it varied greatly without a clear tendency (Table S4
In Africa, the frequency increased from 14% (epiweek 12) to 94.5% (epiweek 32), although a drop in frequency was observed in epiweeks 33 and 34, where only sequences from South Africa and Egypt were available. Most African sequences belonged to South Africa (61%), followed by the DRC (12%), Egypt (7%), and Senegal (5%) (Table S3
). South Africa showed a steady increase in the R203K and G204R combination frequency (≈90% frequency in epiweeks 30–35), in the DRC the frequency varied largely between epiweeks, and in Egypt it was very infrequent, only present in 4% of the total sequences, explaining the drop of the regional rate (Table S4
In Oceania, the R203K and G204R combination frequency steadily increased from epiweek 20–31. Of note, 96% of Oceania’s sequences belonged to Australia where this combination was present in ≈90% sequences since epiweek 26. In New Zealand, it increased from 6% (epiweek 12) to 53% (epiweek 17), the only epiweeks with enough sequences for analysis (≥10), although its total frequency was lower than in Australia (11.8% vs. 68%) (Table S4
In Asia, the combination increased over time, except for epiweeks 26 and 27 and 31 and 32 where the frequency dropped. Most Asian sequences in these epiweeks (>50% in epiweek 26 and 100% in epiweeks 27 and 31) were from Singapore and South Korea, where the R203K and G204R combination was infrequent, explaining the frequency drop in these epiweeks. More than half of Asia’s sequences were from India (31.3%), China (12.7%), and Singapore (11.5%) (Table S3
). The regional presence of the R203K and G204R combination changes in N varied greatly between countries (Table S4
). Considering those countries with >100 sequences, the higher frequencies were found in Bangladesh (87%), and Oman (75%), and the lowest in Malaysia (6%), Singapore (8%), China (9%), and South Korea and Thailand (10%). In India (total rate 36%), this combination frequency slowly increased until epiweek 25 (56%), dropping to ≈20% in the following epiweeks, and rising again to >80% in the last epiweeks that met our criteria (epiweeks 29, 33, and 34). In China, although the R203K and G204R combination total frequency was low, most sequences grouped up in the first epiweeks, where these changes were absent or extremely infrequent. However, in epiweek 14 (10 sequences) the frequency reached 40%, and in epiweeks 28–30 (next epiweeks that meet our criteria) this frequency increased to >80%, as happened in India. In Singapore, the R203K and G204R combination was absent until epiweek 28, when it increased from 4% to 47% in epiweek 36.
Finally, in South America, although some epiweeks did not meet our criteria for the time analysis, the R203K and G204R combination in N increased until epiweek 22, with great variations in frequency in the last epiweeks. Of note, the bulk of the sequences available belonged to Brazil (52%), followed by Colombia (13%), Chile (12%), and Peru (8%) (Table S3
). Brazil showed the highest frequency of this combination (88.7%, Table S4
), and in epiweeks 10–19 (those that met our criteria), the R203K and G204R combination raised >90% since epiweek 15, observing a drop in the next epiweeks with <10 sequences, with the absence of that combination in the nine N sequences available in epiweek 30. In Colombia, Chile, and Peru, an increasing tendency was also observed in epiweeks meeting criteria (with ≥10 sequences). The observed increase was from 13.6% (epiweek 12) to 18.2% (epiweek 15) in Colombia, from 24% (epiweek 11) to 67% (epiweek 18) in Chile, and from 37% (epiweek 12) to 79.2% (epiweek 27) in Peru. In South America, in epiweeks 26 (3.6% regional rate) and 29 (14.3%), most of the available sequences belonged to Suriname, where these changes were extremely infrequent (6.67% total frequency), impacting in a low regional rate in these epiweeks.
In the regional analysis, the most frequent N mutations were the already mentioned D103Y, S194L, and S197L in Europe (3.2%, 2.8%, and 2.3%, respectively), S194L in North America (4.3%); P13L and S194L in Asia (15% and 6.4%), and P13L and S197L in Oceania (5.7% and 4.8%, respectively). Other frequent mutations not mentioned before were S202N in Asia and Africa (2.5% and 3.8%, respectively), I292T in South America (25%), Q384H in Africa (7%), and L230F in Oceania (2.7%).