Evolution of SARS-CoV-2 in Spain during the First Two Years of the Pandemic: Circulating Variants, Amino Acid Conservation, and Genetic Variability in Structural, Non-Structural, and Accessory Proteins

Monitoring SARS-CoV-2’s genetic diversity and emerging mutations in this ongoing pandemic is crucial to understanding its evolution and ensuring the performance of COVID-19 diagnostic tests, vaccines, and therapies. Spain has been one of the main epicenters of COVID-19, reaching the highest number of cases and deaths per 100,000 population in Europe at the beginning of the pandemic. This study aims to investigate the epidemiology of SARS-CoV-2 in Spain and its 18 Autonomous Communities across the six epidemic waves established from February 2020 to January 2022. We report on the circulating SARS-CoV-2 variants in each epidemic wave and Spanish region and analyze the mutation frequency, amino acid (aa) conservation, and most frequent aa changes across each structural/non-structural/accessory viral protein among the Spanish sequences deposited in the GISAID database during the study period. The overall SARS-CoV-2 mutation frequency was 1.24 × 10−5. The aa conservation was >99% in the three types of protein, being non-structural the most conserved. Accessory proteins had more variable positions, while structural proteins presented more aa changes per sequence. Six main lineages spread successfully in Spain from 2020 to 2022. The presented data provide an insight into the SARS-CoV-2 circulation and genetic variability in Spain during the first two years of the pandemic.


Introduction
Coronavirus disease (COVID- 19) was detected for the first time in December 2019 in Wuhan, China [1]. In January 2020, the responsible virus, acute respiratory syndrome coronavirus 2 (SARS-CoV-2), was isolated and the complete viral genome was sequenced [2]. During the period covered in this study (February 2020 to January 2022), 10,122,981 confirmed COVID-19 cases and 93,839 deaths were declared in Spain, according to the Spanish Ministry of Health [3]. The first Spanish COVID-19 case emerged in late January 2020 [4]. However, this cannot be considered patient zero, since there were many independent SARS-CoV-2 introductions to the country at the beginning of the outbreak, with different successful lineages, probably favored by super-spreaders [4,5]. On 14 March 2020, the Spanish government implemented a national lockdown and a state of emergency until 21 June 2020 [6,7]. However, the national deconfinement plan started on 28 April 2020. This plan included four phases, with different degrees of restrictions for each Autonomous Community (AC) and phase. To contain the vast second wave of infection, during October 2020, a second state of emergency was implemented, affecting the national territory or certain Table 1. Proposed molecular functions of the twenty-six SARS-CoV-2 proteins.

Protein
Proposed Molecular Function

Results
A total of 88,248 Spanish SARS-CoV-2 complete and partial sequences collected from 24 February 2020 to 29 January 2022 (corresponding to 101 epiweeks) were downloaded from the GISAID database. After discarding those undated and/or incorrectly classified, more than 70,000 sequences for each SARS-CoV-2 protein were included in the study. The number of sequences available for each protein, period, and AC are described in Supplementary Table S1.

Nucleotide and Amino Acid Variability in the 26 Spanish SARS-CoV-2 Studied Proteins
Nucleotide substitutions between natural bases (guanine, adenine, cytosine, and thymine) were analyzed for each of the 26 SARS-CoV-2 proteins, revealing a total of 32,334 instances of polymorphisms across the SARS-CoV-2 Spanish genomes with available sequences in the GISAID database (Table 2).
Of the total instances of polymorphisms, 17,014 (52.6%) involved transition mutations and 15,320 (47.4%) transversion mutations (Supplementary Table S2), with a ratio of 1:0.90. The group of structural proteins presented more transversion than transition events (1:2.26). All the SARS-CoV-2 genes showed more transition than transversion events among the total instances of polymorphisms, except for the nsp12 (RNA polymerase) gene with a ratio of 1:1, and for nsp11, the Spike, the Nucleocapsid, and ORF7a genes with more transversions than transitions.
After the translation of nt sequences encoding each 26 SARS-CoV-2 protein, we analyzed the number of aa changes, deletions, stop codons, and completely conserved positions. We also calculated the percentage of conservation and mean aa changes per sequence in each protein considering only valid codons (Table 3). A total of 21,433 aa changes, 1579 deletions, and 593 stop codons were detected in the Spanish SARS-CoV-2 proteome, being nsp3 the protein with the highest number of deletions (423) and stop codons (134), followed by the Spike protein (397 and 123, respectively) (Figure 1c), being the largest proteins in the SARS-CoV-2 genome (nsp3,1,945 aa; S protein,1273 aa). Nsp3 encodes the papain-like protease (PLpro, domain within nsp3). In PLpro, we found five deleted residues in a total of eight sequences. The PLpro main catalytic residues, C111, H272, and D286 [109], were highly conserved, finding only two aa changes: C111Y in one sequence and D286N in three sequences of the total Spanish dataset.
The mean aa conservation was above 98% in all the analyzed proteins, with a global aa conservation of 99.69%, showing an average of 1.15 aa changes/deletions per sequence. The protein with the highest mean aa change/deletion frequency per sequence was the Spike (10.8), followed by the Nucleocapsid (3.79). The proteome's mean rate of variable aa positions was 84.06%. Variable aa positions ranged from 63.73% in nsp13 to 100% in ORF8. Twelve proteins presented more than 90% variable positions along their sequence: structural proteins N and S, non-structural proteins nsp1, 2, 3, and 7, and all accessory proteins (ORF 3a-10). The percentage of conserved positions for each protein is illustrated in Figure 1b. proteins (ORF 3a-10). The percentage of conserved positions for each protein is illustrated in Figure 1b.   Although the mutation frequency (Mf) and mean aa conservation were similar between the three groups of studied proteins, non-structural proteins presented the lowest mutation frequency (1.05 × 10 −5 ) and percentage of variable aa positions (79.19%) and the highest aa conservation (99.84%). Structural proteins presented the highest mutation frequency (1.60 × 10 −5 ) and mean aa changes/deletions per sequence (3.87). Meanwhile, accessory proteins showed the greatest percentage of variable aa positions (99.36%) and the lowest aa conservation (97.49%), but also the lowest number of aa changes/deletions per sequence (0.68). Among the structural proteins, the Nucleocapsid protein presented a higher rate of variable positions, followed by the Spike, the Envelope, and the Membrane proteins (Table 3). Our data revealed that nsp13 and nsp10 were the SARS-CoV-2 proteins with the lowest percentage of variable positions (63.73% and 64.03%, respectively), and N and ORF8 the highest, being 99.28% and 100%, respectively (Table 3). Within the Spike, the receptor-binding domain (RBD) region had a mean conservation of 98.89%, with 2.4 mean changes per sequence. All the aa changes detected in this region had a frequency below 10%, except for L452, T478, and N501, the three of them located in the receptor-binding motif.
The residues of the SARS-CoV-2 main protease (nsp5) involved in binding for remdesivir and Paxlovid (two antivirals recommended by the WHO for COVID-19 treatment in patients at risk of hospital admission) were highly conserved in our sequence dataset, with a percentage of mutated sequences below 0.2%. Some of these residues are involved in the binding of both drugs (C145, E166, H163, H164, Q192), while G143 is involved in Paxlovid binding, and other sites interact with remdesivir (H41, M49, Y54, F140, N142, S144, M165, L167, P168, H172, N187, R188, Q189, T190, and A191) [110,111]. We only found one sequence with a deletion in residue 192 (involved in the binding of both drugs) and no other aa changes in the rest of the sites that interact with Paxlovid, which showed complete conservation among the whole Spanish sequence dataset. As for other nsp5 residues that interact with remdesivir, we found aa changes in eight of them (H41Q, M49I/V, N142S, M165I, P168S, R188K/S, Q189K, and A191V/S). A191 was the most frequently mutated residue, presenting changes in 96 sequences, but with an overall very low variability frequency (0.11%). The main change was A191V (85 sequences), followed by A191S (11 sequences) and M49I (9 sequences). The rest of the changes appeared in less than five sequences of the complete dataset.

Amino Acid Variability in Spanish SARS-CoV-2 Structural Proteins
The Wu-Kabat protein variability coefficient (WK) was analyzed in the four structural proteins to study the susceptibility of an aa position to evolutionary replacements (Supplementary Table S3). The analysis showed the position-specific aa variations according to their frequency in the structural proteins and their main domains ( Figure 2).
In the Spike protein (Figure 2a   The Envelope protein had a Wu-Kabat coefficient of 1 in 16% of its 75 residues (Figure 2d). The maximum WK was 6 in positions 32 (A32G/R/I/E/P) and 34 (L34Y/A/F/ H/T), located in the transmembrane domain. The rate of sites with a coefficient of 1 was higher in the surface domain (30.77%), followed by the intravirion domain (14.63%), and the transmembrane domain (9.52%). The PDZ-binding domain (positions 72-75) had a median Wu-Kabat coefficient of 3.5, with all its residues presenting a WK > 1.

Most Prevalent aa Changes and Deletions in the Spanish Sequences
We identified all aa changes and deletions present in ≥10% of the total Spanish sequences per protein in the whole SARS-CoV-2 sequence set, finding 57 changes present in 13 proteins: non-structural proteins 3, 4, 6, 12, 13, and 14; structural proteins Spike, Membrane, and Nucleocapsid; and accessory proteins 3a, 7a, 7b, and 8. Their locations in the genome and prevalence are described in Figure 3. More than half of these changes (56%) were located in structural proteins, 30% were found in nsp, and 14% in accessory proteins. The Spike protein presented the greater number of changes present in ≥10% of the sequences (22), followed by the Nucleocapsid (9). The most frequent aa change in the Spanish dataset was D614G (98.01%) in the Spike protein, followed by P323L (97.50%) in nsp12 and T478K (57.08%) also in the Spike protein.

Most Prevalent aa Changes and Deletions in the Spanish Sequences
We identified all aa changes and deletions present in ≥10% of the total Spanish sequences per protein in the whole SARS-CoV-2 sequence set, finding 57 changes present in 13 proteins: non-structural proteins 3, 4, 6, 12, 13, and 14; structural proteins Spike, Membrane, and Nucleocapsid; and accessory proteins 3a, 7a, 7b, and 8. Their locations in the genome and prevalence are described in Figure 3. More than half of these changes (56%) were located in structural proteins, 30% were found in nsp, and 14% in accessory proteins. The Spike protein presented the greater number of changes present in ≥ 10% of the sequences (22), followed by the Nucleocapsid (9). The most frequent aa change in the Spanish dataset was D614G (98.01%) in the Spike protein, followed by P323L (97.50%) in nsp12 and T478K (57.08%) also in the Spike protein. Most of the other 13 proteins showed a low frequency in their most prevalent aa changes among the available Spanish sequences during the study period: seven of them under 1% (nsp7, nsp8, nsp10, nsp11, nsp15, nsp16, ORF6), and two of them below 2% (nsp1 and nsp9). Four proteins presented changes slightly more prevalent among their sequences: V485I in nsp2 (5.13%), V30L in ORF10 (6.12%), P132H in nsp5 (8.19%), and T9I Most of the other 13 proteins showed a low frequency in their most prevalent aa changes among the available Spanish sequences during the study period: seven of them under 1% (nsp7, nsp8, nsp10, nsp11, nsp15, nsp16, ORF6), and two of them below 2% (nsp1 and nsp9). Four proteins presented changes slightly more prevalent among their sequences: V485I in nsp2 (5.13%), V30L in ORF10 (6.12%), P132H in nsp5 (8.19%), and T9I in E (8.22%).
All aa changes, deletions, and stop codons found in each protein and their frequency are described in Supplementary Table S4.
The frequency of these 57 changes and deletions was further analyzed by epiweeks, grouped into the six previously described periods, in line with the Spanish epidemic curve: 1 (24 February to 20 June 2020), 2 (21 June to 5 December 2020), 3 (6 December 2020 to 13 March 2021), 4 (14 March to 19 June 2021), 5 (20 June to 16 October 2021), and 6 (17 October to 29 January 2022). According to the frequency difference (∆) of these aa changes between consecutive periods, the aa substitutions were grouped into five categories ( Figure 4).
In the first category (rows 1-2 in Figure 4), we grouped the aa changes that became predominant early in the pandemic and fixated in the genome, being present in >99% of the sequences in the following periods: D614G in the Spike protein and P323L in nsp12. Both changes could be detected since period 1.2. The second category (row 3) had Spike's aa change A222V, which increased only in period B (80% of the Spike's sequences) but decreased in the following periods. The third category (rows 4-24) lists the 21 aa changes that increased their frequency during the third wave or period 3, with a maximum prevalence during period 4, decreasing in the next period. Within this group, seven changes (2 in nsp6 and 5 in the Spike) show an increasing tendency in the last period of this study. The fourth category (rows  corresponds to the 22 aa changes that increased in period 5, slightly decreasing their frequency in the next epidemic wave. Finally, the last category (rows [47][48][49][50][51][52][53][54][55][56][57] shows the aa changes that became more frequent in the last period. To detect aa changes or deletions with a significant prevalence in the AC, regardless of their frequency in the total Spanish available sequences during the study period, changes with a frequency ≥10% in each of the 18 AC were analyzed. After discarding the changes that coincided with the 57 most frequent aa changes previously described in the complete set of Spanish sequences (Figure 3), 79 additional changes were found (Supplementary Table S5). Most of these changes were located in the Spike (30). The AC harboring the highest number of changes was Galicia (36), followed by Catalonia (35), and La Rioja and Madrid (34) ( Figure 5). Two AC, Castile La Mancha and Navarre, showed no additional changes. Among the AC, most of these changes were present in the Spike, nsp3, nsp6, and the Nucleocapsid.
Three aa changes were present with a frequency ≥25% in at least one AC: nsp13 K460R in Cantabria (33.13%), ORF10 V30L in Aragon (28.10%), and Nucleocapsid A220V in Aragon and Madrid (31.80% and 26.18%, respectively). These changes were further analyzed to detect if they could be allocated to one or more periods among the six studied. K460R was present throughout the third period and first half of the fourth. V30L was detected mainly in period 2, although it persisted until period 4. A220V was detected earlier in Madrid, in period 1.2, while in Aragon, its detection was delayed until period 2.1, although it reached a greater frequency during the third period in both AC.
The frequency of these 57 changes and deletions was further analyzed by epiweeks, grouped into the six previously described periods, in line with the Spanish epidemic curve: 1 (24 February to 20 June 2020), 2 (21 June to 5 December 2020), 3 (6 December 2020 to 13 March 2021), 4 (14 March to 19 June 2021), 5 (20 June to 16 October 2021), and 6 (17 October to 29 January 2022). According to the frequency difference (Δ) of these aa changes between consecutive periods, the aa substitutions were grouped into five categories (Figure 4).

SARS-CoV-2 Lineages Circulating in Spain during the First Year of the Pandemic per Study Period
After performing the sequence quality control as described in the Methods section, a total of 82,655 sequences were successfully assigned to a lineage according to the Pangolin COVID-19 Lineage Assigner. The complete classification is available in Supplementary Table S6. Figure 6 illustrates the main SARS-CoV-2 lineages per period in Spain after analyzing all available Spanish sequences deposited in GISAID. The figure also includes the epidemiological curve according to the RENAVE Spanish COVID-19 incidence data for each epidemiological week. The Spanish incidence and mortality information retrieved from RENAVE in the study period can be found in Supplementary Table S7 and Figure  S1.

SARS-CoV-2 Lineages Circulating in Spain during the First Year of the Pandemic per Study Period
After performing the sequence quality control as described in the Methods section, a total of 82,655 sequences were successfully assigned to a lineage according to the Pangolin COVID-19 Lineage Assigner. The complete classification is available in Supplementary Table S6. Figure 6 illustrates the main SARS-CoV-2 lineages per period in Spain after analyzing all available Spanish sequences deposited in GISAID. The figure also includes the epidemiological curve according to the RENAVE Spanish COVID-19 incidence data for each epidemiological week. The Spanish incidence and mortality information retrieved from RENAVE in the study period can be found in Supplementary Table S7 and Figure S1.
During the first study period (24 February to 20 June 2020), before the national lockdown (period 1.1, 24 February to 14 March 2020), a total of 11 lineages were circulating in Spain among the sequences available in GISAID. A lineages predominated over B lineages (60.49% vs. 39.51%), with 53.09% of the sequences belonging to lineage A.2 and 7.28% to lineage A.5. After the lockdown, the presence of B lineages increased to 77.86% before the deconfinement plan (period 1.      (112/146). The Gamma and Beta VOC could still be detected in period 5 in low frequency (0.71% and 3.69%, respectively). In the last study period or period 6, the Delta VOC remained the most frequent variant (72.98%). Delta sublineages represented 74% of the circulating lineages and sublineages during this period (127). The Omicron VOC was introduced and quickly increased its frequency over the epidemiological weeks, representing 26.99% of the sequences circulating in Spain in period 6.
The number of available SARS-CoV-2 sequences in each AC was uneven (Supplementary Table S6). Figure 7 illustrates the SARS-CoV-2 lineages' evolution in each Spanish AC and study period, after including the AC with at least 10 sequences for each phase or period.
In period 1.1, the A.2 lineage predominated in Andalusia, Basque Country, La Rioja, Navarre, and the Valencian Community, whereas B lineages (B.1 followed by B) were the main circulating lineages in Aragon, Asturias, Balearic Islands, Catalonia, Extremadura, Galicia, and Madrid (Figure 7). In the next period (1.

Discussion
Monitoring SARS-CoV-2's genetic diversity and emerging mutations in this ongoing pandemic is essential to understand the evolutionary trend of this new coronavirus and to ensure the performance of new diagnostic tests, vaccines, and therapies against COVID- 19. Spain has been one of the European countries with the highest number of COVID-19 cases, according to the European Centre for Disease Prevention and Control (ECDC, https: //www.ecdc.europa.eu (accessed on 22 January 2022)). Previous studies have analyzed the epidemiology of SARS-CoV-2 in Spain [5,112], certain Spanish cities or AC [113][114][115][116][117][118], and variants [119,120]. However, as far as we know, this is the first study including all SARS-CoV-2 GISAID available sequences from the 17 AC and 2 Autonomous Cities since the beginning of the pandemic until the sixth epidemiologic wave, including the most prevalent mutations. This descriptive study reports not only on the Spanish circulating variants in the different study periods and AC, but also on the conservation, most frequent aa changes, mutation rate, and genetic variability across structural, non-structural, and accessory SARS-CoV-2 proteins in Spain during the first two years of the pandemic, discussing their possible structural and biological implications.
In the Spanish SARS-CoV-2 sequences downloaded until January 2022, the mean genome mutation frequency (Mf) was 1.24 × 10 −5 . In a previous study with SARS-CoV-2 global sequences collected until 21 August 2020, the overall point mutations took place at a frequency of 9.4 × 10 −6 [121]. Despite the difference regarding the geographical origin of the samples, this suggests an increase in the number of point mutations along the viral genome throughout the last few years.
The mutation frequency (Mf) and mean aa conservation were similar between nonstructural, structural, and accessory proteins, while the percentage of variable positions in the aa sequence and the mean changes per sequence showed greater differences among the three groups of proteins.
Non-structural proteins had the lowest Mf (1.05 × 10 −5 ), highest Ts/Tv ratio, highest conservation (99.84%), and least variable aa positions per sequence (1.25). The fact that many nsp are involved to a greater or lesser extent in the replication and transcription complex (Table 1) could explain why these proteins are more conserved and less mutation-tolerant. However, in Roy et al.'s analysis, nsp presented a much lower Mf (8.78 × 10 −6 ) [121], suggesting that, although highly conserved, point mutations have increased even in non-structural proteins.
Structural proteins presented the highest Mf (1.60 × 10 −5 ), being the only group of proteins with more transversion than transition events (Ts/Tv ratio 1:2.26). In Roy et al.'s study, all the SARS-CoV-2 genes had transition:transversion ratios greater than 1, although a considerable number of transversions were detected, highlighting the fact that these mutations are less likely to maintain the structural properties of the original amino acids [121]. Nevertheless, Roy et al.'s study was performed before the circulation of more heavily mutated VOC, while, in our study, there was a large proportion of sequences belonging to VOC lineages. The fact that VOC harbor a significant number of mutations in the Spike protein [122][123][124] would correlate with the greater number of mean aa changes per sequence in the structural proteins (3.87), mainly in the Spike (10.80), found in our dataset.
Despite these results, accessory proteins presented lower aa conservation and a greater number of variable aa positions compared to the structural proteins (99.36% vs. 99.42% and 97.49% vs. 86.49%, respectively), indicating that, in accessory proteins, the mutations affect more residues along the length of the protein, while, in the structural proteins, mutations are concentrated in certain positions of the protein gene. Many accessory proteins influence the host immune response and participate in viral virulence (Table 1). This fact could be related to the greater variability detected in these proteins, given that, in the context of the adaptation of SARS-CoV-2 to the human host throughout the pandemic, the virus has been progressively exposed to natural or vaccine-induced antibodies.
In Orthocoronavirinae, the sections of the genomes that show the largest divergence in protein domains are located in the proteins encoded in the N-terminal end of the ORF1ab, the Spike, and mainly in the accessory proteins, where each subgenus possesses an almost subgenus-specific set of accessory proteins [17]. On the other hand, the other structural proteins and the nsp implicated in the RTC, such as 3C-like protease (nsp5), RNA-dependent RNA polymerase (RdRp, nsp12), and Helicase (nsp13), show stable domain architectures across all Orthocoronavirinae [17]. In our Spanish sequence set, accessory proteins showed the lowest percentage of conserved positions (Figure 1b). Among them, ORF8 presented changes in all its positions. ORF8 has been described as a rapidly evolving accessory protein, proposed to interfere with immune responses [88], mediating the immune evasion of SARS-CoV-2 [86], which can explain why ORF8 harbored changes in 100% of its residues. In contrast, the proteins with the highest number of conserved positions were the nsp, specifically nsp5, nsp10, and nsp13 (>35% of conserved positions, Figure 1b).
Among the four structural proteins, the Nucleocapsid and the Spike genes showed more transversion than transition events, being N the most mutation-prone gene with the highest mutation frequency (Table 2), a trend previously observed in other studies performed in Spain and other countries, such as Canada and South Africa [125]. The Spike presented the highest mean aa change/deletion frequency per sequence, while presenting more conserved positions along its structure than N, pointing again to the high presence in the total sample of Spike heavily mutated VOC. When examining the aa conservation, it was over 99% in the four structural proteins, slightly higher in the Envelope (99.84%), which also presented fewer mean changes per sequence. The Membrane was the protein with a lower Mf (9.14 × 10 −6 ), lower percentage of variable positions, and more sites with a WK of 1 (31.53% of its positions). The lower variability of M and E is in line with other studies including worldwide sequences [125][126][127]. However, the E gene has shown signatures of positive selection along with S in previous studies [128].
Analyzing the position-specific aa variability in the structural proteins can point to which regions or domains of the protein are most and least conserved. This can be useful to put into context the performance of current real-time reverse transcriptase-polymerase chain reaction (RT-PCR)-based diagnostic tests or for a more rationale design of new diagnostic tests and vaccines. The introduction of the Alpha variant revealed that the failure of some RT-PCR-based diagnostic tests to detect the S gene (S gene dropout) could be used for its diagnosis [129,130]. This method was widely used to detect this variant in Spain during the first few months after its introduction [131]. Although newer RT-PCR tests have introduced many other targets, S gene dropout could still be useful to detect the Omicron variant with some of them [132,133].
In the Wu-Kabat analysis, 132 of the Spike's positions had a WK of 1, indicating no aa variability, most of them within the S2 subunit. Compared to a similar analysis performed on worldwide Spike sequences retrieved until June 2020 by Rahman et al., our results showed a much lower rate of invariable positions (48% vs. 10%) [134], indicating a greater number of Spike's sites prone to aa changes in the last few years, compatible with viral evolution. However, 76% of these completely conserved sites were the same positions as in Rahman's et al. study, most of them (86%) located in the S2 subunit. As for highly variable sites, our analysis showed a 25-times increase in sites with WK > 4 (472 vs. 19 sites) [134]. The RBD within the S protein is the primary target of neutralizing antibodies in naturally acquired or vaccine-elicited humoral immunity [135]. In our results, the RBD had a median WK of 4 with a maximum coefficient of 11.53 in site 484, followed by site 501, both located within the receptor-binding motif. Changes in these sites have been reported in several VOC variants and have been related to the neutralization escape of antibodies [136].
A similar variability increase was observed when comparing the N results to another study by the same author performed on global Nucleocapsid sequences (retrieved until July 2020), which showed lower variability [137]. The N protein presented a WK of 1 in 24% of the sites vs. 3.10% in our study, with only seven positions coinciding, together with more highly variable positions with a WK > 4 (64% vs. 27%) [137]. The N region with greater variability was the SR linker (WK 7), in line with previous global studies [126]. This region forms a phosphorylation-dependent binding domain for protein 14-3-3, a signaling molecule involved in various cellular processes, such as cell cycle, survival, and death [138].
On the other hand, Rahman et al.'s Wu-Kabat analysis of the Envelope, performed on global sequences retrieved until August 2020 [127], showed similar results to our analysis, finding the same percentage of E sites with a WK of 1 (16%) and almost the same number of variable sites with a WK > 4 (13 vs. 14 in our study). This suggests that, disregarding the geographical origin and the year of sampling, E is highly conserved among SARS-CoV-2 variants.
It has been reported that multiple introductions of SARS-CoV-2 to Spain took place at the beginning of the pandemic [5,112]. In the first study period (24 February to 20 June 2020), we detected 44 different lineages and sublineages circulating in Spain. In period 1.1, before the national lockdown, more than 60% of the Spanish sequences belonged to A lineages, in contrast to the predominance of B lineages found in other European countries during this period [139]. The two main A sublineages circulating in Spain at that moment were A.2 and A.5, both classified as endemic Spanish lineages [5,112]. However, B sublineages were more frequent in some AC (Figure 7), especially B.1, the second most frequent lineage after A.2 in our dataset. The B.1 lineage corresponds to a large European lineage whose origin is related to the Northern Italian outbreak early in 2020 [140], and it became the main circulating variant during the rest of period 1. The national lockdown effectively reduced the reproductive number and COVID-19 incidence [141,142]. The reduction in SARS-CoV-2 variants' diversity between period 1.2 and period 1.3 suggests that the national lockdown was also effective in reducing the import of SARS-CoV-2 lineages during this period.
Two aa changes, D614G in the Spike protein and P323L in the RNA-dependent RNA polymerase (RdRp, nsp12), increased in frequency during period 1 and became dominant in the rest of the study periods ( Figure 4). The success of the D614G mutation early in the pandemic was noted worldwide [126] and has been related to an increase in viral fitness [143,144]. This change was usually accompanied by RdRp P323L mutation, previously known as the "G clade" by GISAID nomenclature [144]. Although P323L is not located in the RdRp catalytic site, due to the RdRp's key role in viral replication, any changes in its structure are of concern. It has been suggested that this change could alter RdRp's interaction with its cofactors and anti-viral drugs [145]. Both changes have also been related to increased COVID-19 severity [146], but P323L's effect on viral fitness remains unclear.
During period 2 (21 June to 5 December 2020), B.1.177 became the most successful lineage in Spain and the main lineage in most AC (9 AC in period 2.1 and 12 AC in period 2.2). This lineage has been related to the opening of borders within Europe during the summer of 2020, which allowed the rapid spread of the B.1.177 variant from Spain to other European countries [147]. Among the predominant aa changes detected during this period and related to this lineage (Figure 4), A222V Spike mutation had been detected in March in Tunisia and Iran, with a low mutation rate that increased in Spain in June 2020, similarly to A220V Nucleocapsid mutation [148]. In contrast to D614G, none of them have proved to confer increased transmissibility to the virus [147]. Therefore, the success of this lineage could be more directly linked to a lack of epidemiologic control in the viral spread than to an increase in viral fitness. The greater variant diversity detected in Spain during this period, most related to European countries but also from other continents, suggests that, despite the efforts to avoid SARS-CoV-2 spread between countries, travel restrictions during the summer of 2020 were not sufficient.
In the following periods, two VOC spread successfully after their introduction in Spain: Alpha and Delta. Both VOC have been associated with an increase in transmissibility and disease severity [149][150][151], but only the Delta VOC has shown evidence of an impact on immunity [152][153][154]. The Alpha VOC (B.1.1.7) was detected in our dataset for the first time in eight sequences collected in the Valencian Community in period 2.2 (October-November 2020). Its frequency increased during period 3 (December 2020-March 2021), representing almost half of the studied sequences, becoming the main circulating lineage in period 4 (March-June 2021). In the Spanish National Health report on circulating variants published on 26 March 2021, the Alpha variant represented >50% of the sequences in most AC and >70% in eight AC in the random sampling for epidemiological surveillance [155], becoming the dominant variant in June 2021 [156].
The Delta variant (B.1.617.2/AY) was detected for the first time in our dataset during period 4 in 11 AC, becoming the main circulating lineage of the following epidemic waves (June 2021-January 2022, periods 5 and 6). In the Spanish National Health report on circulating variants published in August 2021, the Delta variant increased its incidence during the summer of 2021, accounting for 47 to 96% of the COVID-19 cases across the different AC in July 2021 [157]. We found great diversity in the circulating Delta clusters detected in our sequence set. The ones detected during periods 4 and 5 were European, circulating mainly in Spain, the United Kingdom, Germany, and France. During the last period, Delta clusters common outside Europe, circulating in the United States of America, increased their frequency.
The rapid and efficient spread of these VOC suggests an increase in SARS-CoV-2's viral fitness promoted by specific aa changes. In our analysis of the most frequent changes throughout the six epidemic waves (Figure 4), many of them were associated with certain periods in which one of the mentioned VOC prevailed. Spike mutations were the most abundant mutations, present in ≥10% of the total sequence dataset. Three of these mutations were located in the Spike RBD: L452R, T478K, and N501Y. L452R increased its frequency during periods 5-6. It is present in the Delta, Epsilon, and Kappa variants and has been related to immune escape [158,159]. T478K was mainly present in the last two periods. It can be found in the Delta and Omicron VOC and has been associated with increased ACE2 affinity and immune escape [160]. N501Y increased in periods 3-4 and 6. This aa change is present in the Alpha, Beta, Gamma, and Omicron VOCs, being Alpha the main lineage in period 4 and Omicron an increasing lineage in period 6. N501Y has been associated with greater ACE2 affinity and increased viral replication in human upper airway cells [161][162][163]. Several highly prevalent aa changes were located in the Spike's S1 subunit (outside the RBD), five of them being deletions. H69del, V70del, and Y145del increased during periods 4 and 6. They are present in the Alpha and Omicron VOC. H69del and V70del have been associated with increased infectivity in Spike proteins that have acquired immune escape mutations that carry an infectivity cost [164,165]. Y145del has been described to impact immunity [166,167]. Deletions in sites 157-158, together with E156G (periods 5-6), have been associated with higher infectivity and reduced sensitivity to neutralization [152,158] and are present in the Delta VOC. The Spike site 618 is located next to the SARS-CoV-2 furin cleavage site. Two aa changes, P681H/R, were found in this site, mainly in periods 4 and 5, respectively. P681H is present in the Alpha and Omicron VOC and may increase the rate of Spike protein cleavage [163,168], although this mutation has not been proven to impact viral entry or spread [169]. P681R, also present in the Delta VOC, has been reported to have a similar effect as that described in P681H, increasing furin-mediated cleavage [170]. During period 4, A570D (end of S1 subunit) and S982A (S2 subunit) frequency increased. These mutations are present in the Alpha variant and may enhance cleavage into the S1 and S2 subunits by reducing the intermolecular stability of Spike protein subunits [163]. Fewer highly prevalent mutations were located in the Spike S2 subunit. Among them, D1118H (present in the Alpha variant) increased in period 4. This mutation has been suggested to impact trimer assembly [171], but its implications are not well known.
Another structural protein where several highly prevalent mutations could be found was the Nucleocapsid. Among these changes, three were located in the SR linker: R203K/M and G204R. R203K and G204R increased in period 4. It is a double aa change observed in global sequences [126]. It has been reported that these changes have arisen by homologous recombination rather than stepwise mutation and that viruses harboring these aa changes may also have increased expression of sub-genomic RNA from other open reading frames [172]. The Nucleocapsid protein's main functions involve RNA binding, the replication and transcription of viral RNA, and the formation and maintenance of the ribonucleoprotein complex (see Table 1). However, it also participates in type I IFN inhibition [35,36] and the upregulation of subgenomic RNA and protein levels of the N protein have been observed in the Alpha variant, leading to enhanced immune evasion [173].
As for the other proteins, highly prevalent changes were found in non-structural proteins 3, 4, 6, 12, 13, and 14, and the accessory proteins ORF3a, ORF7a/b, and ORF 8. Most of these aa changes' biological implications are not well known. However, nsp3, nsp6, ORF7a, and ORF 8 have been associated with host immune response evasion, as described in Table 1. The nsp6 deletions 106-108 are present in the Alpha, Beta, Gamma, Eta, Iota, Lambda, and Omicron variants, and could play a role in IFN-I evasion [161] due to nsp6's role in antagonizing the type I interferon (IFN-I) response [43]. ORF8 has also been related to the modulation of the immune response (Table 1), and although the implications of R52I and Y73C are unknown, they may impact the ORF8 structure. R52 establishes two hydrogen bonds that stabilize its structure, and Y73 is part of a motif responsible for stabilizing an extensive noncovalent dimer interface [88]. ORF7a has been less studied, but this protein is involved in type I INF inhibition [43], NF-κB activation [80], JNK and IL-8 activation [80], and modulation of the inflammatory response (see Table 1). Further studies should be performed to clarify the impact of accessory protein mutations in SARS-CoV-2 host immune evasion. Nsp3 also plays an important role in other crucial functions such as polyprotein processing and viral spread (Table 1). Nsp12 (RNA-dependent RNA polymerase), 13 (Helicase), and 14 (Exonuclease) have major implications in the RTC ( Table 1).
The Beta (B.1.351) and Gamma (P.1) VOC were also detected in Spain after period 3 but in low frequency (<1%). According to the official reports, these VOC were present in Spain in the following months but in a small proportion [156]. In our analysis, the Gamma VOC reached greater prevalence during period 4 (4.44%) and the Beta VOC during period 5 (1.61%).
The first sequences belonging to the Omicron VOC in our dataset were detected in the last period (October 2021-January 2022) in 12 AC, representing 26.99% of the Spanish sequences circulating in period 6. The Omicron variant was first reported to WHO from South Africa on 24 November 2021 and later declared a VOC. This variant is the most mutated SARS-CoV-2 variant to date and has been associated with an increase in infectivity and transmissibility and immune escape, but not with greater COVID-19 severity, and even with milder symptoms [124,174]. However, it exhibits significant resistance to the neutralizing activity of current vaccines [174,175]. According to the January 2022 Spanish National Health report, the Omicron variant was introduced into Spain in late November 2021, increasing its incidence progressively until it surpassed Delta in mid-December 2021, accounting for 70-90% of the cases in the different Spanish AC [176]. Among the Omicron sequences studied, 99.6% were BA.1 sublineages, with only 22 sequences belonging to the BA.2 sublineage. Both BA.1 and BA.2 are considered Omicron VOC, although they differ in their genetic sequence [108,174]. In a later report, published in February 2022, there was an increasing tendency in sublineage BA.2 cases [177]. As of May 2022, BA.2 is the predominant sublineage in Spain [178].
Spike mutations have been studied in more depth than other SARS-CoV-2 protein mutations, mainly due to the protein's major role in infection, vaccine development, and antibody escape that some of these mutations may elicit [179][180][181]. However, it is essential to study nsp and accessory protein mutations and their implications, as many highly successful variants share mutations in proteins other than the Spike that could impact the host immune response or viral fitness. For this, randomized sequencing should be continued even in low-incidence settings. Moreover, SARS-CoV-2 sequencing should be encouraged in low-income countries by implementing international collaborations when possible.
To date, 75% of the European population has received at least one dose of COVID-19 vaccine, according to the ECDC. In Spain, the COVID-19 vaccination campaign began in December 2020 and was developed in stages, prioritizing certain population groups after the evaluation of their risk of exposure, transmission, and serious disease, as well as the socioeconomic impact of the pandemic, mainly healthcare workers and elders [182]. To date, 93% of the population over 12 years of age has received full vaccination and 80% of them at least one booster dose, while more than 40% of the pediatric population has received the complete vaccination schedule [182].
The Spike protein is the main protein used as a target in COVID-19 vaccines. Vaccineinduced neutralizing antibodies (nAbs) can target the S protein to inhibit virus infection at multiple stages during the virus entry process, being the RBD the major target for nAbs interfering with viral receptor binding [183,184]. Furthermore, the S protein is also a target for T-cell responses [185]. There has been some controversy regarding whether vaccination can be a source of SARS-CoV-2 mutations [186], and two antibody-disruptive co-mutations in the Spike (Y449S and N501Y) have been described as a new vaccine-resistant transmission pathway [187]. However, it has also been stated that vaccines can prevent their emergence [188,189]. SARS-CoV-2's main mechanism of evolution is natural infectivitybased selection [187], where a high number of infections and high viral load within the host would facilitate the emergence of a wider range of mutations. Current vaccines have proven effective in reducing the number of infections and hospitalizations [190,191]. Even in the presence of VOC with mutations that alter vaccine efficacy, full vaccination is effective against severe COVID-19 caused by non-Omicron variants [192], resulting in a milder and shorter course of COVID-19, while booster doses have proven to improve neutralization against Omicron [175]. Furthermore, intra-host viral evolution during persistent infections leading to SARS-CoV-2 mutations identified in immune escape variants has been observed in immunocompromised patients [193,194]. In spring 2022, the mandatory use of face masks was repealed in Spain [195]. Although Omicron infections are generally milder, given the increasing incidence of COVID-19 in Spain, a second booster dose for elders and immunocompromised patients who are at risk of hospitalization should be considered. Meanwhile, the development of vaccines that include Omicron mutations should be encouraged. Currently, Pfizer and Moderna are evaluating Omicron-based vaccines [196,197].
However, emerging SARS-CoV-2 variants presenting a large number of mutations in the Spike protein may interfere with vaccine efficacy, as has been observed with the Omicron variant [174,175], and other targets should be considered for vaccine development. Within the Spike, according to our data, the S2 subunit is more conserved than the S1 subunit. S2 can also be a potential target for nAbs that interfere with the structural rearrangement of the S protein and the virus-host membrane fusion [198,199], and it would be interesting to include it in vaccine design together with other SARS-CoV-2 protein targets. However, this subunit contains more extensive N-glycan shielding and is less immunogenic than S1 [200].
According to our data, the E and M proteins are highly conserved among variants, which would make them suitable candidates for vaccine development. These proteins have already been proposed as vaccine targets [201,202]. However, the M and E proteins are poorly immunogenic [203], although they present T-cell epitopes in SARS-CoV and MERS-CoV [204]. Therefore, similarly to the Spike S2 subunit, these two proteins could be useful to broaden vaccine protection if included together with the S1 Spike subunit as an optimization strategy.
Another suitable option to avoid vaccine inefficacy due to emerging mutations is the use of inactivated or attenuated vaccines that contain the complete virus, which would theoretically induce broader antibody and T-cell responses, being less likely to become ineffective in the context of new SARS-CoV-2 variants. Currently, there is one live attenuated virus and nine inactivated virus vaccines in phases III and IV of clinical evaluation according to the WHO [205].
Nevertheless, considering currently available vaccines, a large part of the world population remains unvaccinated, with the global risk that this poses. For this reason, the WHO and other entities have created the Multilateral Leaders Task Force on COVID-19 Vaccines, Therapeutics, and Diagnostics (www.covid19taskforce.com (accessed on 12 May 2022)), whose aim is to vaccinate 40% of each country's population by the end of 2021 and 60% by mid-2022, an aim that has yet to be met in most African countries. Access to vaccines in developing countries is a major concern, and European and other developed countries should promote these objectives to the best of their ability.
As for therapeutical approaches, at the beginning of the pandemic, the high mortality and lack of effective treatment options encouraged the use of repurposed drugs such as chloroquine or lopinavir [206,207], lacking robust clinical evidence of their efficacy and no longer recommended by the WHO [208]. Clinical trials are still under development for various monoclonal antibodies, although many have been ceased due to futility [209]. Remdesvir (Veklury by Gilead Sciences), a broad-spectrum antiviral originally developed to treat other viruses such as Ebola, was the first repurposed drug approved by the FDA for the treatment of hospitalized people aged 12 years and older with COVID-19 [210,211]. This drug inhibits viral RNA-dependent RNA polymerase (RdRp, nsp12) while evading proofreading by viral exoribonuclease, which leads to premature termination of RNA transcription [212]. An initial WHO conditional recommendation made in November 2020 suggested not to use remdesivir for patients with COVID-19, regardless of illness severity. However, in the tenth iteration of the guideline, a new WHO recommendation was made for the use of remdesivir for patients with non-severe illness at highest risk of hospitalization [208]. The recommendation for patients with severe or critical COVID-19 is currently under review and it will be updated shortly. Using computational approaches, it has been proposed that remdesivir binds to more than one target of SARS-CoV-2, showing strong binding affinity with the M protein, RdRp, and np5 or 3CLpro [110].
Regarding new drugs to be developed, non-structural proteins are good candidates considering their lower Mf and highest conservation according to our data, together with their critical role in the replication and transcription complex (Table 1). Among them, some proteins can be interesting drug targets, such as the previously mentioned 3-chymotrypsinlike protease (3CLpro or nsp5), the protein with the lowest Mf in our dataset (7.73 × 10 −6 ), involved in polyprotein processing (see Table 1). Indeed, 3CLpro inhibitor molecules have proven to increase survival in infected mice [213] and have been considered as candidates to inhibit SARS-CoV-2 [214].
Currently, an elective inhibitor of the SARS-CoV-2 3CLpro developed by Pfizer (paxlovid, nirmatrelvir-ritonavir) has reached a phase III trial [209], and the WHO has set a strong recommendation for its use in non-severe COVID-19 patients at highest risk of hospitalization, considering it the best therapeutic choice for high-risk patients to date [215]. Pfizer's oral antiviral drug paxlovid (PF-07321332 + ritonavir) reduces hospital admissions and deaths among people with COVID-19 who are at high risk of severe illness (with a reported reduction of 89% within three days of symptom initiation) when compared with a placebo [214,216]. PF-07321332 is a reversible covalent inhibitor that targets SARS-CoV-2 3CL-pro, forming a covalent bond to the catalytic nsp5 residue C145, being further stabilized through a network of hydrogen bonds and hydrophobic interactions, which enhance its binding to the active site of 3CL-pro, involving another five residues [111]. In our variability analysis of the sites involving paxlovid and remdesivir binding, we found high conservation in all the residues. No aa changes were detected in C145 in our dataset, and only a single sequence presented a deletion in one of the residues (Q192) that enhanced nsp5 binding to paxlovid. However, future analysis should be conducted to survey the emergence of 3CL-pro mutations in the context of SARS-CoV-2 treatment regarding these drugs.
Other nsp proteins that were highly conserved in our results were the helicase (nsp13), with the third lowest Mf (Table 2), and the 2 -O-Methyltransferase (nsp16). However, many factors, including toxicity, bioavailability, and effective delivery, must be considered for drug development. Clinical trials are complex and expensive, and the fall in mortality due to vaccination, higher preparedness in hospitals, and preventive measures such as face masks and hand-washing may reduce the efforts devoted to the development of new drugs against COVID-19. Nevertheless, similarly to new vaccine development, in the context of the increasing SARS-CoV-2 variability detected in many of its proteins in this study, the continuous emergence of new variants across the globe, and the risk of the future reemergence of this virus or other coronaviruses, drug development should be encouraged and pursued.
The main limitation of this study is the uneven number of SARS-CoV-2 sequences across AC and study periods available in the GISAID database, especially at the beginning of the pandemic. This was due to many factors, such as technical and economic availability for SARS-CoV-2 sequencing across the Spanish hospitals, variable incidence among periods and AC, and differences in the diagnostic protocols between AC.
Although the present study only focuses on the SARS-CoV-2 evolution in one country during the first year of the pandemic, these data can be of high interest since Spain has been one of the main epicenters for COVID-19, reaching the highest number of cases and deaths per 100,000 population in Europe at the beginning of the pandemic. Furthermore, the fact that Spain is one of the leading European tourist destinations can favor the spread of new SARS-CoV-2 variants and could explain the high diversity of circulating variants observed in our study, mainly after the lockdown.

Materials and Methods
SARS-CoV-2 sequences were downloaded in nucleotides (nt) from the publicly available GISAID repository (https://www.gisaid.org/ (accessed on 02 February 2022)). We selected those sequences classified as human hosts, located within Europe/Spain and ascribed to an Autonomous Community (AC), submitted until 2 February 2022, and collected from 24 February 2020 to 29 January 2022. We then classified the sequences according to the epidemiological week (epiweek) by collection date. Epiweeks are a standardized method of counting weeks to allow for the comparison of epidemiological data. By definition, the first epiweek of the year ends on the first Saturday of January, as long as it falls at least four days into the month. Each epiweek begins on a Sunday and ends on a Saturday. The present study included SARS-CoV-2 sequences collected from 2020 epiweek 9 (24 February 2020) to 2022 epiweek 4 (29 January 2022).
To contextualize the changes in the virus throughout the pandemic, epiweeks were grouped into six main periods adjusted to the Spanish epidemic curve, as informed by the National Epidemiological Surveillance Network (RENAVE) [217]. Period 1 was further divided into three phases according to the Spanish government's measures implemented to prevent the spread of the virus: period 1.1 before the national lockdown, period 1.2 during the national lockdown until the beginning of the national deconfinement plan, and period 1.3 until the end of the first epidemic wave. Period 2 was subdivided into two periods according to the two peaks of incidence in this second epidemic wave, one after summer 2020 with a rise in the instantaneous basic reproductive number (Rt) at the beginning of July included in period 2.1, and a second peak before winter 2020 with another rise in the Rt in mid-October covered by period 2.2. The time span and major events of each study period are described in Table 4. Wuhan SARS-CoV-2 was taken as the reference sequence (NCBI accession number NC 045512.2) to identify the nt mutations and aa changes in the annotated proteins. Sequence analysis was performed with an in-house bioinformatics tool (EpiMolBio) previously designed and used in our laboratory for HIV genetic variability analysis and recently updated for SARS-CoV-2 sequence study [126,[218][219][220][221][222]. This tool is programmed in JAVA OpenJDK version 11.0.9.1 using IDE NetBeans version 12.2 and allows the simultaneous analysis of a high number (>650,000) of sequences. Functions related to protein tracking, trimming, and aligning were tested with Mega X, and functions related to aa change identification were tested manually and using Excel 2019 version 19.0. Using EpiMolBio tool, the complete nt sequences from 26 structural, non-structural (nsp), and accessory viral proteins were cut, aligned, and translated into amino acids (aa). The final analysis included nsp1-10 (polyprotein1a and 1ab), nsp11 (polyprotein 1a), nsp12-16 (polyprotein1ab), structural proteins Spike (S), Nucleocapsid (N), Membrane (M), and Envelope (E), and accessory proteins 3-10, according to NCBI 045512.2 annotation. This program detects any nt/aa in the sequence set different to the reference one for each position and calculates the number and frequency of nt/aa changes for that site, ignoring unidentified nt, nonsense mutations, and unknown amino acids that could be present due to the low quality of some regions of the original sequences, failing to attribute a nucleotide with certainty. EpiMolBio tool allows the analysis of partial or low-quality genomes as long as the residue of the studied position is present, enabling a much larger set of sequences to be studied.
The number of polymorphisms in the SARS-CoV-2 pan-genome and each studied protein was calculated, as well as the ratio of transitions (nt changes between the two purines A and G or between the two pyrimidines C and T) and transversions (nt changes between a purine and a pyrimidine). We also calculated the frequency of base mutations (Mf) or mutation frequency according to the following formula: Mf = P i/(Ln × N) [121], being Pi the number of instances of polymorphism detected, Ln the nucleotide length of the genome or locus, and N the number of sequenced entities present in the dataset.
In each of the 26 SARS-CoV-2 proteins, we calculated and compared the mean aa conservation, the number of aa changes, deletions, or stops, and the number of conserved and variable positions within each protein. We also identified the presence of mutations and the variability of the SARS-CoV-2 main protease (nsp5 or 3CL-pro) residues involved in binding with two of the current WHO-recommended drugs for COVID-19 [208], remdesivir and nirmatrelvir-ritonavir (sold under the name Paxlovid).
The Wu-Kabat protein variability coefficient (WK) was calculated and analyzed in the context of the proteins' domains and relevant functional sites, according to the Uniprot database (https://www.uniprot.org (accessed on 15 February 2022)) annotation. This coefficient allows the study of the susceptibility of an aa position to evolutionary replacements [223]. It is calculated using the following formula: Variability = N × k/n, where N is the number of sequences in the alignment, k is the number of different amino acids at a given position, and n is the absolute frequency of the most common amino acid at that position. Therefore, a WK of 1 indicates that the same aa was found for that position in the entire sequence set, whereas a WK > 1 indicates the relative variability of the respective site, with greater diversity as the WK value increases.
The frequency of the aa changes and deletions was calculated in all the Spanish sequences. Those changes present in ≥10% of the sequences were further studied. To detect the behavior of these changes in time (increase or decrease in their frequency), they were analyzed in the six main periods previously described, calculating the frequency difference between periods (∆) and comparing it. A second analysis was carried out considering the aa changes and deletions different to those previously detected in the complete Spanish dataset and present in ≥10% of each of the 17 AC and two Autonomous Cities. This allowed us to detect any relevant aa change limited to a particular AC, given that the course of the pandemic and the containment measures established since the deconfinement plan differed between AC. The changes were located in each protein, compared between AC, and those with a high prevalence (≥25%) were also analyzed by period. The 17 Spanish AC are Andalusia, Aragon, Asturias, the Balearic Islands, Basque Country, the Canary Islands, Cantabria, Castile La Mancha, Castile and Leon, Catalonia, Extremadura, Galicia, La Rioja, Madrid, Murcia, Navarre, and the Valencian Community. The two Autonomous Cities were grouped into an 18th AC to simplify the analysis comprehension; these are Ceuta and Melilla, both located in the North of Africa.
For the correct assignment of the SARS-CoV-2 variants, the quality of all the downloaded sequences was checked using Nextclade v.1.11.0 (https://clades.nextstrain.org/ (accessed on 10 May 2022)), and the sequences classified as "bad" quality were removed from the analysis. This tool performs quality control based on a score that considers missing data, mixed sites, private mutations, mutation clusters, stop codons, and frameshifts. The remaining sequences were assigned to the genetic lineages according to Pangolin COVID-19 Lineage Assigner v 4.0.6 (https://pangolin.cog-uk.io/ (accessed on 10 May 2022)) to contextualize the aa changes found in the different phases and the evolution of the pandemic in Spain during the study period. For this analysis, we also included 417 additional sequences from the Canary Islands that had not been classified according to the location criteria previously described in this section after confirming their geographical origin. The Pangolin COVID-19 Lineage Assigner software assigns lineages using a nomenclature based on a hierarchical system [104] and is the one currently used in Spain for the epidemiological monitoring of COVID-19. The Pangolin lineage list (https://cov-lineages.org/lineage_list.html (accessed on 10 May 2022)) was used to locate the main countries of origin of the detected Spanish lineages. Funding: This research was partially supported by FONDOS FUR 2020/0285 and by Fundación Familia Alonso. This study is also included in the "Subprograma de Inmigración y Salud" from CIBERESP (Spain). P.T. was funded by ISCIII-Programa Estatal de Promoción del Talento-AES Río Hortega exte. CM19/00057. R.R. was funded by FONDOS FUR 2020/0285 and Fundación Familia Alonso. The funders had no role in the design of the study, in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.
Institutional Review Board Statement: Not applicable.

Informed Consent Statement: Not applicable.
Data Availability Statement: Sequences publicly available in the GISAID database https://www. gisaid.org (accessed on 2 February 2022).