Comparative Analysis and Classiﬁcation of SARS-CoV-2 Spike Protein Structures in PDB

: The Spike (S) protein of the SARS-CoV-2 virus that causes the COVID-19 disease is considered the most important target for vaccine, drug and therapeutic research as it attaches and binds to the ACE2 receptor of the host cells and allows the entry of this virus. Analysis and classiﬁcation of newly determined S protein structures for SARS-CoV-2 are critical to properly understand their functional, evolutionary and architectural relatedness to already known protein structures. In this paper, ﬁrst, the comparative analysis of SARS-CoV-2 S protein structures is performed. Through comparative analysis, the S protein structures in the PDB (protein data bank) database are compared and analyzed not only with each other but with the structures of other viruses for various parameters. Second, the S protein structures in PDB are classiﬁed into different variants, and the associated published literature is studied to investigate what kind of therapeutics (antibodies, T-cell receptors and small molecules) are used on the structures. This is the ﬁrst study that classiﬁes the S protein structures of the SARS-CoV-2 in PDB into various variants, and the obtained comparative analysis results could be beneﬁcial to the research community, in general, and to crystallographers and health workers, in particular.


Introduction
SARS-CoV-2 (Severe Acute Respiratory Syndrome Coronavirus 2) virus [1] that causes the COVID-19 disease was initially identified in late 2019. This virus belongs to the family of single-strand RNA viruses. Their shape is spherical to pleomorphic and their length is between 80 and 160 nm. The genome of this virus contains 6 to 12 open reading frames (ORFs). The rest of the genome encodes four structural proteins (Envelope (E), Membrane (M), Nucleocapsid (N) and Spike (S)) and some non-structural proteins ( Figure 1). ORF1b, the main reading frame, produces pp1a and pp1ab polyproteins that are cleaved by the PL pro (papain-like protease) and 3CL pro (3C-like cysteine protease). 3CL pro is also called M pro (main protease) [2][3][4]. Spike protein is a trimeric class I glycoprotein located on the surface of the viral envelope and contains the S1 and S2 subunits. S1 contains binding domains, such as the receptor-binding domain (RBD) and the N-terminal domain (NTD), and S2 contains a fusion peptide and two heptad repeats (HR) 1 and 2 domains that contribute to the virus's fusion [5].
SARS-CoV-2 can enter the cell membrane of the host when the S protein interacts with a receptor present on the host surface called "angiotensin-converting enzyme 2" (ACE2) [6][7][8][9][10]. Thus, the role of the S protein is fundamental in the pathogenesis, transmission and virulence of SARS-CoV-2. Compared to SARS-CoV, the S protein of SARS-CoV-2 binds to ACE2 with a higher affinity [11]. After binding, the entry of SARS-CoV-2 depends on the S protein priming process. This process is carried out by TMPRSS2 [8] (Figure 1), which is a type 2 serine protease present on the host cell surface. Blocking or preventing the binding of S proteins with ACE2 receptors is considered the first and most important The S protein interaction with the ACE2 recep biology techniques, such as Cryo-EM (electron mic These techniques elucidate the 3D (three-dimensio conformational changes. X-Ray crystallography pro mation about macromolecules. It is relative, cheap, resolution. The Cryo-EM technique provides macr resolution. Cryo-EM has certain advantages over Xthe crystallization bottleneck and requiring lower p of the SARS-CoV-2 virus, its protein structures ar databases. Some famous databases are protein data copy Data Bank (EMDB) [13]. Until 5 November 2 tures are deposited in PDB. These online databases ticular viral structure, its function and its molecula (1) validating drug targets, (2) assessing the ability The S protein interaction with the ACE2 receptor can be found by using structural biology techniques, such as Cryo-EM (electron microscopy) and X-Ray crystallography. These techniques elucidate the 3D (three-dimensional) structures of proteins and their conformational changes. X-Ray crystallography provides high-resolution structural information about macromolecules. It is relative, cheap, simple and can produce a high atomic resolution. The Cryo-EM technique provides macromolecular structures at near-atomic resolution. Cryo-EM has certain advantages over X-Ray crystallography, such as avoiding the crystallization bottleneck and requiring lower protein amounts. Since the emergence of the SARS-CoV-2 virus, its protein structures are deposited at a fast speed in online databases. Some famous databases are protein data bank (PDB) [12] and Electron Microscopy Data Bank (EMDB) [13]. Until 5 November 2022, more than 195,000 protein structures are deposited in PDB. These online databases allow researchers to analyze any particular viral structure, its function and its molecular basis. These structures are used for (1) validating drug targets, (2) assessing the ability of the target drug, (3) characterizing the ligands, small molecules and other tool compounds that bind to the drug targets, (4) guiding the medicinal chemistry optimization of binding affinity and (5) overcoming the challenges in pre-clinical drug development.
Almost all proteins in general share structural similarities with each other due to the principles of chemistry and physics. These principles put a limitation on the number of ways a polypeptide chain in structures can be folded into a compact globule. Comparing and classifying new 3D protein structures can reveal biologically hidden and interesting similarities. These insights can contribute to a better understanding of not only the evolution but the structural architecture and protein's function, particularly for those that share little with well-characterized proteins in terms of sequence identity. These insights are important for drug discovery, identifying new folds for protein and for the proteome's COVID 2023, 3 454 phylogenetic analysis. Considering the large number of S protein structures currently available in the PDB, we focus on comparative analysis and classification of the S protein structures of SARS-CoV-2. More specifically, the S protein structures of SARS-CoV-2 in PDB are analyzed and compared not only with each other but also with the protein structures of other viruses through structural alignment, similarity matrix, structures imposition and phylogeny. Moreover, the S protein structures of SARS-CoV-2 in PDB are classified into different variants. The associated published literature, in the last two and a half years, that offer structures for the S protein of SARS-CoV-2 is reviewed to study the therapeutics (antibodies, T-cell receptors and small molecules) used on the structures. We believe that obtained results by comparative analysis and structures classification could be beneficial to the research community, in general, and to crystallographers, biochemists and health workers, in particular.

Methods
DALI (Distance matrix alignment) server [14,15] is used for the comparison and analysis of SARS-CoV-2 S protein structures. DALI is famous for protein structures comparison and can be used to compare new structures with existing structures in PDB.
In order to maximize the DALI score, a set of one-to-one correspondences between the protein (sub)structures A and B are optimized: where LALI represents the number of residue pairs that are aligned, D = 20 A, θ = 0.2 and d A ij , d B ij represent the intra-molecular Cα-Cα distances in structures A and B, respectively. The expected Dali score (Equation (1)), for random pairwise comparison, increases with the number of amino acid residues in the proteins that are being compared.
Z-Score: The statistical significance of a pairwise comparison score is described by the DALI Z-score (Z AB ): where L = √ L A L B is the average length of A and B. A large set of random pairs of structures is used to empirically derive the relation among m (mean score), L and σ (standard deviation). A polynomial fit gave the following approximation: The empirical estimate for σ was σ(L) = 0.5 × m(L). Now, for every possible pair of domains, the Z-score is computed and the highest value is identified as the Z-score for a protein pair. Thus, Z-score reported by DALI is an optimal score for similarity that is defined as the sum of equivalent residue-wise Cα-Cα distances between two proteins. In other words, the 3D coordinates of proteins are used in the calculation of the Z-score to measure the degree of similarity among proteins. For two proteins, the large Z-score indicates more similarity that corresponds to the optimized set of residue equivalence discovered through the Monte Carlo optimization permutation of equivalent structural patterns. A Z-score < 2 is considered as spurious similarity and can be ignored.
Root Mean Square Deviation (RMSD): RMSD is a numerical measure of the difference between two structures and shows how well they align. For two protein structures A and B, RMSD can be calculated as follows: where N is the number of Cα atoms, and d(a i , b i ) distance is the difference in position of atom i in each molecule. DALI supports three types of database searches: (1) PDB, (2) PDB25 and (3) AlphaFold (AF). DALI uses two strategies, pairwise and all against all, for structures comparison. The S protein structure of SARS-CoV-2 with PID 6VSB [11] (deposited to PDB on 10 February 2020) is used as the query structure. 6VSB is a 3D, 3.5 Å Cryo-EM-based SARS-CoV-2 S structure in the prefusion conformation. The main reason to select 6VSB as a query structure is that it is one of the earliest S protein structures deposited in PDB.
The COVID-19 coronavirus resource in the PDB website (rcsb.org) offers a quick link "www.rcsb.org/news/feature/5e74d55d2d410731e9944f52 (accessed on 19 November 2022)" to the submitted (1) Main proteases, (2) Spike proteins and RBDs and (3) Papainlike proteinases of SARS-CoV-2. More than 1000 and 500 S protein and M pro structures, respectively, of SARS-CoV-2 are present in the PDB. Each structure in PDB is associated with a paper link that deposited that structure to the PDB. For the S protein structures, the associated literature was collected and analyzed/read for the variants types and to see what kind of antivirus drugs were used on the S protein structures. The S protein structures and the published literature from PDB were collected in the time period 15 March 2022 to 20 September 2022.
The published literature related to the S protein structures of SARS-CoV-2 was also retrieved from the Web of Science (WoS) Core collection database. The keywords used for the articles collection were "Cryo-EM", "SARS-CoV-2", "COVID-19" and "Spike". In total, WoS found 158 articles. After excluding the review articles, the total number of articles was 143. The temporal distribution of collected articles is listed in Table 1. We found that half of the S protein structures were deposited to PDB in 2021, while the other half in 2020 and 2022. The top five most cited papers [6,11,[16][17][18] determined the S protein structures, respectively.

Results
The query structure 6VSB is compared with other protein structures in the PDB using DALI with four search subsets: PDB25, PDB50, PDB90 and PDB (or PDBall). DALI uses the subsets because the PDB database is highly redundant as the 3D structures of some proteins and their mutants are determined in various conditions. PDBall contains all structures in the PDB. Whereas PDB25, PDB50 and PDB90 are non-redundant subsets of PDB structures. In the PDB25 subset, those structures in PDB that are less than 25% identical in sequence are found. PDB50 and PDB90 search return PDB structures that are less than 50% and 90% identical in sequence. The three different sequence identity levels are derived from PDB using CD-Hit [19]. Thus, for 6VSB, DALI returned 46, 91, 178 and 2115 similar PDB structures in PDB25, PDB50, PDB90 and PDBall, respectively. Table 2 lists the most similar structures to 6VSB on the basis of the Z-score. The percent identity of aligned amino acids is also listed along with the Z-score. We also used BLAST [20] to see the first 100 similar structures to the query structure. The S protein structure of SARS-CoV-2 with PDB ID 7LQV has 100% identity while 6ZPO has 95% identity. Interestingly, all the 100 similar structures returned by BLAST belonged to the S protein structures of SARS-CoV-2. In the exhaustive PDB25 search results, only one structure (PID: 7E9T, Z-score of 10.8) belongs to the S protein of SARS-CoV-2 and eight structures are S protein structures of other viruses (listed in Table 3). 7E9T is the structure of the SARS-CoV-2 S protein in the post-fusion state that provided insights into the design of a viral entry inhibitor [21]. The remaining 37 are other structures that belong to other species and proteins/enzymes. Similarly, with PDB50/PDB90 search, 91/178 similar structures are found, out of which 10/22 structures belong to the S protein of the SARS-CoV-2 and 78/108 belong to the other. The majority of the structures in Tables 2 and 3 belong to the RBD region of S, with some belonging to the NTD region. The structures 7N9E, 7N9T and 7N9C belong to a group of nanobodies-bound RBDs of S protein structures that targets epitopes of three different classes [22]. Similarly, 7RA8 is the structure of SARS-CoV-2 S RBD antigenic site II in complex with a human monoclonal antibody (MAb) S2X259 [23]. In Table 3, the structures 6NB3, 6NB4 and 6NB6, 6NB7 are the S protein structures for MERS and SARS-CoV, respectively, with neutralizing antibodies (NAbs) LCA60 and S230 in different states, respectively [24]. Whereas 7AKJ is the SARS-CoV S protein structure with the NAb 47D11.
In PDB25 and PDB50, the percentage of the similar protein structures that are same to the S protein of SARS-CoV-2 is approximately 2.17% and 3.29%, respectively. Whereas in PDB90, the percentage is 27.5%. In the PDBall search, 2115 similar structures are found. Note that similar structures found in PDB25 are present in PDB50 and structures found in PDB50 are present in PDB90 and so on. From Table 3, we can see that the S protein structures of SARS-CoV-2 are similar to other coronaviruses (SARS-CoV, bat coronavirus and MERS-CoV, particularly) and to the viruses from other animals, such as pangolin, bat and mouse. Indeed, earlier genome sequencing also shows that SARS-CoV-2 is more similar (79%) to its predecessor (SARS-CoV-1), 50% to MERS-CoV and 96% to RaTG13 (bat coronavirus) [1,7]. In PDB50, the S structure of SARS-CoV-2 with PID 7QO9 is for the Omicron variant that was deposited to PDB within 1 month after the first emergence of the Omicron variant on 24 November 2021. The calculated RMSD value of 7QO9 against the wild-type S (6VSB) structure in [25] was 2.4 which includes particular modifications to the surfaces and local conformations involved in the antibody recognition. DALI's RMSD value for these structures is 2.5. In PDB90, the structures that belong to various variants are Alpha (7KE8, 7R15, 7KRQ), Beta (7RA8, 7Q9P, 7VXF, 7WEV, 7N9T, 7E7X), Gamma (7SBT), Delta (7TPH, 7TOU), Omicron (7TM0, 7THK, 7WVO, 7WP9, 7TGE, 7T9J, 7TL1, 7QO9) and Kappa (7SBR). The similar protein structures given in Table 3 for PDBall are those that are present in the first 1000 similar structures. For three search strategies (PDB25, PD50 and PDB90), the percent identity computed by DALI is different from BLAST. Overall, BLAST returned high similarity for structures compared to DALI. We perform some more analysis by using the pairwise comparison in DALI to examine the similarity of the S protein structures of SARS-CoV-2 with the protein structures of other viruses. The 10 most similar structures (7CN8, 6CS2, 6NZK, 6NB4, 6JX7, 5I08, 6CV0, 6M15, 7E9T and 3AFK) alignment with the query structure in DALI is performed. More than 970 amino acids (AA) ( Figure 2) and secondary structure elements (SSEs) (Figure 3) in 11 structures were aligned. Among 10 structures, only 7E9T is the S protein structure of SARS-CoV-2. Eight structures (listed in Table 2) belong to the S protein structure of other viruses, such as pangolin (7CN8), SARS-CoV (6CS2), human coronavirus OC43 (6NZK), MERS-CoV (6NB4), human coronavirus HKU1 (5I08), avian bronchitis coronavirus (6CV0), Rhinolophus bat coronavirus HKU2 (6M15) and Feline infectious peritonitis (FIP) virus (6JX7). 3AFK is the crystal structure of Agrocybe aegerita lectin AAL. The structures at the top show the similar structures that share many frequent AAs at various positions. The most frequent AAs are Leucine (L), followed by Valine (V), Threonine (T), Alanine (A) and Serine (S), respectively. The structures at the bottom are not that much similar as they share less same AAs. Pangolin-CoV S structure (7CN8) is more similar to the query structure followed by 6CS2 and 6NZK. COVID 2023, 3, FOR PEER REVIEW 7 similar as they share less same AAs. Pangolin-CoV S structure (7CN8) is more similar to the query structure followed by 6CS2 and 6NZK.
In the secondary structure, the two most common elements of a protein are α-helices and β-sheets (consist of β-strands). Traditionally, the secondary structures of a protein are characterized with three general states: (1) coil (C), (2) helix (H) and (3) strand (E). The Dictionary of Secondary Structure of Proteins (DSSP) [26] offers a finer classification of the secondary structures with eight states: (1) α-helix (H), (2) 310 helix (G), (3) π-helix (I), (4) β-strand (E), (5) bridge (B), (6) turn (T), (7) bend (S) and (8) others (C). From Figure 3, the most frequent SSE are L, followed by H and E. Figure 4 also confirms that considered structure for superimposition with the query structure has more L, followed by H and E.  . Structural alignment of AAs in 10 protein structures against the query structure. The most frequent AA type in each column is colored. For example, purple residues (S, T, N and Q) are uncharged, brown residues (R and K) are polar positively charged, blue residues (D and E) are polar negatively charged, dark green residues (F, Y, W and H) are non-polar except histidine (positive) but they are aromatic, light green residues (A, V, I and L) are non-polar residues and yellow residue (C) is a special case. The uppercase letters that are bold show conserved structurally equivalent positions with the query structure and non-color regions are variable regions.  In the secondary structure, the two most common elements of a protein are α-helices and β-sheets (consist of β-strands). Traditionally, the secondary structures of a protein are characterized with three general states: (1) coil (C), (2) helix (H) and (3) Figure 3, the most frequent SSE are L, followed by H and E. Figure 4 also confirms that considered structure for superimposition with the query structure has more L, followed by H and E. Next, some protein structures from Tables 2 and 3 are compared against the query structure in the 3D structure view. The superimposition view provides sequence conservation and structure conservation. In the 3D superimposition view, the S protein structures for pangolin (Figure 4a (Figure 4f,g). DALI also computes traditional measures such as RMSD and LALI (the number of equivalent residues). In 3D superimposition, the RMSD value shows the average deviation in distance among the aligned Cα atoms. RMSD value is 1.0 for those sequences that are half (50%) identical. Against the query structure, 6CS2 has the lowest RMSD value (1.9), followed by 7QO9 (2.5), 6NZK (3.4), 7CN8 (3.6) and 6JX7 (5.7). Whereas 7CN8 (6NZK and 6CS2) has 965 (876 and 796) aligned C-α traces (a reduced complexity representation to only show the α carbons of the AA), followed by 6JX7 (558) and 7QO9 (487). 6JX7 has the most equivalent AA residues (1245), followed by 6NZK (1175), 7CN8 (1125), 6CS2 (891) and 7QO9 (588). 7CN8 has more aligned AA (94%), fol- Next, some protein structures from Tables 2 and 3 are compared against the query structure in the 3D structure view. The superimposition view provides sequence conservation and structure conservation. In the 3D superimposition view, the S protein structures for pangolin (Figure 4a), SARS-CoV-2 Omicron (Figure 4b), SARS-CoV (Figure 4c), human coronavirus OC43 (Figure 4d) and FIP virus (Figure 4e) are superimposed on the query structure (Figure 4f,g). DALI also computes traditional measures such as RMSD and LALI (the number of equivalent residues). In 3D superimposition, the RMSD value shows the average deviation in distance among the aligned Cα atoms. RMSD value is 1.0 for those sequences that are half (50%) identical. Against the query structure, 6CS2 has the lowest RMSD value (1.9), followed by 7QO9 (2.5), 6NZK (3.4), 7CN8 (3.6) and 6JX7 (5.7). Whereas 7CN8 (6NZK and 6CS2) has 965 (876 and 796) aligned C-α traces (a reduced complexity representation to only show the α carbons of the AA), followed by 6JX7 (558) and 7QO9 (487). 6JX7 has the most equivalent AA residues (1245), followed by 6NZK (1175), 7CN8 (1125), 6CS2 (891) and 7QO9 (588). 7CN8 has more aligned AA (94%), followed by 6CS2 (78%), 7QO9 (73%), 6NZK (31%) and 6JX7 (29%). Thus, the FIP virus structure is most dissimilar to the query structure. All the proteins in secondary structures have more coil (loop), followed by helix and strands.
The similarity matrix (Figure 5a) based on the Z-score of 6VSB with 61 other structures and dendrogram (Figure 5b) shows that 6VSB is closest to pangolin coronavirus structures (7CN8 and 7BBH) and occupies a dominant position in the fold cluster. In the literature, various studies suggested that bat and pangolin species are the most natural reservoir of SARS-CoV-2. Moreover, Zhang et al. [27] reported the structures for pangolin (PCoV_GX, PID: 7CN8) and bat (RaTG13, PID: 7CN4) S proteins. They found that in the overall structure, the S RBDs of PCoV_GX and RaTG13 are very similar to the RBD of SARS-CoV-2. The structural comparison of PCoV_GX, RaTG13 and SARS-CoV-2, their binding strength with ACE2 and their efficiency in facilitating pseudovirus cell entry suggest the cross-species transmission and evolution of SARS-CoV-2. Note that the bat RaTG13 (PID: 7CN4) structure is not included in the similarity matrix as DALI did not find this structure similarity with the query structure in the search subsets (PDB25, PDB50 and PDB90).  The scatterplot ( Figure 6) is for the relationship of RMSD with Z-score. As the structures are compared against 6VSB, the cluster for S protein structures is far from other two on the basis of high Z-scores. For the S protein structures, the outliers (6VV5, 6U7K, 7E9T and 6JX7) occurred due to low Z-scores, while for other two structure clusters, the outliers (2WSU and 5MQR, 5WRU) occurred due to high RMSD values. Some S protein structures such as 6U7H, 6B7N and 6NB7 have high RMSD values ≥ 5. Rhinolophus bat coronavirus  (Table 3). Another small cluster, containing PIDS 6N3R, 6L6A, 4XZP, 4HLO, 2WSU, 3WUC, 3AFK and 1UL9, at the right middle mostly contains protein structures for Lectin and S-type Lectins (also known as Galectin) that are sugar-binding proteins of type E. coli. The third cluster (PIDs: 4ASM, 4AWD, 5OCQ, 4WZF, 5NDL, 5GMT, 6KCV, 7C8F, 1MVE, 1KIT and 5MQR) at the top right is for the structures of an enzyme that belongs to Lyase and Hydrolase. The phylogeny, which shows the evolutionary relationship, of the 62 proteins as a dendrogram is shown in (b). Labels are linked to structural summaries. The average linkage clustering of the 62 structures based on the DALI Z-score is used to generate the phylogeny. The dendrogram groups the most strongly similar structures quite well. Again, one can see that the 18 structures for the S protein of other viruses form a clade of closely related structures. Similarly, Lectin (Galectin) and Lyase (Hydrolase) form their own clades. Some structures such as the bottom four (6YSJ, 6YEJ, 7BTX and 4ZEL), with low Z-score values, are the outgroup as they are less related structures to 6VSB. Note that the branching order gets more or less arbitrary nearer the root.
The scatterplot ( Figure 6) is for the relationship of RMSD with Z-score. As the structures are compared against 6VSB, the cluster for S protein structures is far from other two on the basis of high Z-scores. For the S protein structures, the outliers (6VV5, 6U7K, 7E9T and 6JX7) occurred due to low Z-scores, while for other two structure clusters, the outliers (2WSU and 5MQR, 5WRU) occurred due to high RMSD values. Some S protein structures such as 6U7H, 6B7N and 6NB7 have high RMSD values ≥ 5. Rhinolophus bat coronavirus HKU2 (6M15) is not present as its Z-score and RMSD values against all structures were 0. The SARS-CoV-2 S protein structure (7E9T) is far from its respective clusters and close to the other two clusters because of the low Z-score. The two hydrolase structures (5MQR and 5WRU) at the top left have high RMSD values and low Z-score. Using the two similarity measures alone, particularly RMSD, cannot provide good clusters and comprehensive results for similarity. The correlation found between structures through two measures is consistent with the results obtained with the similarity matrix based on Z-score with small differences. In fact, the correlation results provide more detailed information, suggesting that it is better to use several similarity metrics than the use of each similarity measure alone. The pair of structures that are very close to each other but belong to different clusters in Figure 6 such as the pair 6VV5, 6W4Q, the pair 2WSU, 1KIT and the pair 3WUC, 1MVE have very low similarity identity against the query structure. BLAST returned percent identity of 33.89%, for 6VV5 and no significant similarity in sequence for 6W4Q, 2WSU, 1KIT, 3WUC and 1MVE against the query structure. The SARS-CoV-2 S protein structure (7E9T) is far from its respective clusters and close t the other two clusters because of the low Z-score. The two hydrolase structures (5MQR and 5WRU) at the top left have high RMSD values and low Z-score. Using the two sim larity measures alone, particularly RMSD, cannot provide good clusters and comprehen sive results for similarity. The correlation found between structures through tw measures is consistent with the results obtained with the similarity matrix based on Z score with small differences. In fact, the correlation results provide more detailed infor mation, suggesting that it is better to use several similarity metrics than the use of eac similarity measure alone. The pair of structures that are very close to each other but be long to different clusters in Figure 6 such as the pair 6VV5, 6W4Q, the pair 2WSU, 1KI and the pair 3WUC, 1MVE have very low similarity identity against the query structure BLAST returned percent identity of 33.89%, for 6VV5 and no significant similarity in se quence for 6W4Q, 2WSU, 1KIT, 3WUC and 1MVE against the query structure. Figure 6. Relationship of two measures (Z-score and RMSD) for protein structures' similarity. Th structures that belong to three clusters are colored (yellow for S protein structures, green for Lecti (Galectin) and purple for Lyase (Hydrolase)). Structures that do not belong to three clusters are red colored. Lyase (Hydrolase) and Lectin (Galectin) are very close to each other on the basis of th correlation between RMSD and Z-score.

Structures Classification and Their Description
The S protein structures of SARS-CoV-2 belong to either the original Wuhan strain or to different variants in PDB. We classified them into different variants (Table 4). Fo VOCs, Omicron S protein structures are highest (28.63%), followed by Beta (24.41% Delta (15.25%), Alpha (12.44%) and Gamma (7.51%). In PDB, 7.74% (0.5%) of S protei structures belong to VOI Kappa (Epsilon and Brisdelta). For the S protein, approximatel one-third of the published literature in PDB reported the structures to gain insight int this virus's functionalities and to properly understand how the evolution affects the viru

Structures Classification and Their Description
The S protein structures of SARS-CoV-2 belong to either the original Wuhan strains or to different variants in PDB. We classified them into different variants (Table 4). For VOCs, Omicron S protein structures are highest (28.63%), followed by Beta (24.41%), Delta (15.25%), Alpha (12.44%) and Gamma (7.51%). In PDB, 7.74% (0.5%) of S protein structures belong to VOI Kappa (Epsilon and Brisdelta). For the S protein, approximately one-third of the published literature in PDB reported the structures to gain insight into this virus's functionalities and to properly understand how the evolution affects the virus transmissibility and immune evasion. The majority of this literature, 80%, focused on the S protein structure in SARS-CoV-2 variants. The remaining 20% studied the S protein structures in SARS-CoV-2 original strains. Omicron is the focus now as it is the only one left in the current VOCs list by the WHO. On the other hand, two-third of the published literature in PDB for S protein not only derived the structures but also investigated the therapeutic potential of antibodies, T-cell receptors (TCRs) and inhibitors for neutralizing SARS-CoV-2. For antivirus development, the S protein structures from the original strains are discussed more, followed by Beta, Omicron and Delta. The first structures for SARS-CoV-2 S protein were deposited to PDB in February 2020 by Chinese [6,17] and USA institutes [11,16].

Drug Development
S protein's vital role in viral infection makes it a target for antibody-blocking therapy, vaccine development and small molecule inhibitors against SARS-CoV-2. Table 5 summarizes the studies for S protein structures in complex with antibodies of SARS-CoV-2 variants. Yang et al. [28] examined the mutations' impact, particularly the N501Y mutation, on the Alpha variant by revealing the structural basis of two NAbs (RBD-chAb15 and RBD-chAb 45) that bind to the RBD region of the S protein. Xu et al. [29] isolated nine MAbs from mice and investigated their neutralization potency against four variants. They identified three types of MAbs (S5D2, S5G2 and S3H3). S5D2 efficiently neutralized all variant pseudoviruses, but its potency against Delta decreased significantly. The S5G2 exhibited comparable neutralization towards the variants while S3H3 was able to neutralize, near equally, all variants. Liu et al. [30] extracted 674 MAbs from individuals infected with the Beta variant. Eighteen MAbs targeted the three mutations (E484K, K417N and N501Y) in Beta RBD. Liang et al. [31] developed and checked the efficacy of the mRNA-LNP vaccine against the Beta and Gamma variants as well as the wild type in the mouse. The protein structures were used to find the molecular basis for potent MAb T6 Fab. Cao et al. [32] reported the immune response of the plasma and NAbs against Beta among individuals vaccinated with the inactivated vaccine (CoronaVac) or RBD-subunit vaccine (ZF2001) and those infected with the SARS-CoV-2. Nearly half of the identified anti-RBD NAbs displayed neutralization reductions. Whereas the plasma from individuals of convalescents and CoronaVac vaccinees showed comparable neutralization reductions. For ZF2001, the extended interval in the second and third doses produced improved neutralizing activity and more tolerance towards Beta compared to the conventional three-dose treatment. Zhang et al. [33] extracted 165 antibodies from eight SARS-CoV-2-infected patients. They find that potent NAbs have IGHV3-53/3-66 common in their heavy chain. NAb P5A-3C8 showed high defensive efficiency in a SARS-CoV-2-infected hamster. However, the K417N mutation in Beta RBD eliminated these antibodies' neutralizing activity.
Du et al. [34] characterized eight NAbs for the SARS-CoV-2 Beta variant. Five NAbs directly antagonized the binding. The remaining three neutralization mechanisms do not depend on the blocking of ACE2. Two NAbs (BD-812 and BD-836) showed high neutralizing potency against Beta. Dejnirattisai et al. [35] reported that compared to Beta, Gamma is noticeably less resistant to naturally occurring or vaccine-induced antibody responses. This suggested that mutations outside the RBD have an impact on neutralization. MAb 222 neutralized three VOCs. Wang et al. [36] found that the mutation (T478K) enhanced the Delta interaction with ACE2. MAb 8D3 was identified as a neutralizing crossvariant antibody. Moreover, the five tested MAbs, which target RBD, remained effective on Delta. Convalescent and vaccine sera were used by Liu et al. [37] to neutralize the two variants (B.1.617.1 and B.1.617.2). It was found that both variants' neutralization is reduced compared with the original strains. Moreover, the Gamma and Beta variants sera fail to neutralize the Delta variant which suggests that these variants are antigenically divergent. Chi et al. [38] developed various NAbs for Omicron and other VOCs elicited by the Ad5-nCoV vaccine and examined the responses of the human antibody after vaccination. The ZWD12 MAb showed potency and neutralization against six variants. ZWD12 and ZWC6 provided complete protection in the K18-hACE2 transgenic mouse model. Yin et al. [39] reported how the S protein of Omicron variant maintains the binding to the human ACE2 receptor and even gets stronger. The antibody JMB2002 effectively neutralized the four VOC, but not Delta.
Wang et al. [40] found six groups of NTD-directed NAbs from convalescent individuals and/or individuals vaccinated with the mRNA vaccine. Two NAbs C1520 and C1791 were able to recognize epitopes on the opposite faces of NTD. They found that vaccination increases the responses of anti-NTD against Omicron. Zhou et al. [41] identified RBDdirected NAbs to neutralize the Omicron variant. The sera from individuals who already received two or three doses of CoronaVac were examined [42] to see whether they could neutralize Omicron. They found that the third vaccine dose of CoronaVac significantly boosts the geometric mean neutralization antibody titer (16.5-fold) for Omicron. Moreover, a subset of MAbs, derived from memory B cells in individuals having three vaccine doses, neutralized all VOCs. Li et al. [43] identified two regions in the S RBD of Omicron recognized by NAbs. BN03, which is a bi-specific single-domain antibody, was produced that can be transported to the lung through inhalation. BN03 showed high neutralization efficiency in mice infected with SARS-CoV-2. Guo et al. [44] investigated how Omicron retains strong ACE2 interactions that help in evading certain NAbs and found one anti-RBD NAb (5105A) to validate how Omicron escapes the neutralization of the antibody. Nutalai et al. [45] investigated the three sub-lineages of Omicron. The RBD of BA.2 lineage has a high affinity for ACE2 compared to the lineages BA.1 and BA.1.1. On the other hand, the BA.2 lineage is more difficult to neutralize compared to the BA.1 lineage. Some mutations such as S371F (in BA.2 lineage) and R346K (in BA.1.1 lineage) reduced the neutralization effect of the Vir-S309 antibody. The structure and function analysis of potent MAbs obtained from Pfizer-BioNtech vaccinated individuals revealed two core clusters for these MAbs within the RBD. McCallum et al. [46] discussed the properties of the S protein of the Epsilon variant. In the signal peptide, this variant has one mutation (S13I). A signal peptide is a short peptide (generally 16-30 AAs long) located in the N-terminal of protein and contains information for protein secretion. One mutation (L452R) in RBD and one mutation (W152C) in NTD were also detected. These mutations reduced the neutralizing activity in RBD and NTD-directed MAbs. Table 5 summarizes the studies for S protein structures in complex with antibodies of SARS-CoV-2 variants.
Beside B cells, T cells are also critical against SARS-CoV-2 as they form a long-term memory response. Two studies [47,48] provided the structural information for T-cell receptors (TCRs). Nguyen et al. [47] determined three structures (PIDs: 7M8U, 7M8T and 7M8S) of the S-derived peptides presented by three frequently expressed HLA (human leukocyte antigen) molecules. Wu et al. [48] resolved the structures of TCR (YLQ7 (PID: 7N1D) and RLQ3 (PID: 7N1C), from patients infected with SARS-CoV-2 in complex with HLA-A2 and two epitopes of S protein (YLQ (PID: 7N1A) and RLQ (PID: 7N1B)). These TCRs elicited the T-cell responses of universal CD8+ in HLA-A2 patients. The public TCRs were found in various individuals which were unrelated to each other. On the other hand, private TCRs were distinct among individuals.

Small Protein Inhibitors
Small and stable proteins that can block the S binding to ACE2 were designed by Cao et al. [49]. The best mini-proteins bind with high affinity and prevented the infection of SARS-CoV-2 in mammalian Vero E6 cells. In mammalian cells, the mini-proteins do not need expression compared to antibodies. These proteins can be formulated directly to the nasal (or respiratory system) because of their small size and good stability. Transferring high concentrations of viral inhibitors into the respiratory system through the nose generally provides not only prophylactic protection but a therapeutic benefit to treat the early infection. Furthermore, the structural study [50] of inhibitors showed that the NTD region of the SARS-CoV-2 S binds biliverdin and bilirubin with nanomolar affinity. Particularly, biliverdin significantly reduced the reactivity of SARS-CoV-2 S. They reported the structure of SARS-CoV-2 S protein NTD with biliverdin (PID: 7B62). Potential drug targets include all of the enzymes involved in coronavirus replication. However, the focus in the antiviral studies for small-molecule inhibitors is on proteases, such as PL pro , M pro and RdRP (RNA-dependent RNA polymerase) [3,4,[51][52][53][54][55][56][57][58].

Discussion
The S protein structures of SARS-CoV-2 in PDB are analyzed for similarity using DALI for various parameters. Generally, almost all proteins are structurally similar to other proteins. These similarities in protein structures arise from chemistry and physics principles. Finding the evolutionary relationships in protein structures can reveal surprising similarities. The comparative analysis of 3D S protein structures of SARS-CoV-2, obtained using Cryo-EM or X-Ray, reveals biologically interesting similarities that are hard to detect by only comparing the sequences. Protein structures are compared with sequence alignment, structures superimposition in 3D, similarity matrix and dendrogram. We believe that the obtained results may help in inferring the functional properties of SARS-CoV-2 S proteins. The correlation results of two similarity measures (Z-score and RMSD) showed that the usage of both measures provided better results compared to the ones when only one measure, i.e., RMSD, is employed. The earliest S protein structures of SARS-CoV-2 (PIDs: 6VSB, 6M17, 6VXX, 6VYB and 6M0J) were deposited to PDB in February 2020 [6,11,16,17]. These structures provided not only the basis but important details and information for the development of therapeutics that targeted the interaction and binding of the S protein with the ACE2 receptor. The reputed journals Science, Cell, Nature Communication, Cell Reports and Nature published the largest portion of studies for the S protein structures of SARS-CoV-2. Institutes from the USA (Harvard Medical School, Washington University and NAIAD) published the highest number of papers, followed by China (CAS and Tsinghua University). Approximately 50% of studies related to the SARS-CoV-2 studies were published in 2021.
Besides comparing the S structures, we also classified them into different variants and found that the majority of the studies focused on the S of SARS-CoV-2 variants. More than 50% of variant structures belong to Omicron and Beta. The majority of the structures belong to the RBD region of S. Till now, almost all of the identified neutralizing antibodies/nanobodies for SARS-CoV-2 target two main regions (RBD or NTD) of the S protein as both of them offer the two main neutralization targets for antibodies. Particularly, all MAbs that have been approved or are undergoing clinical testing are aimed at the RBD regions of SARS-CoV-2, which targets a number of overlapping and non-overlapping epitopes [29].
Generally, the neutralization potency of RBD-targeted MAbs correlates with its ACE2blocking efficiency. On the other hand, the binding epitopes of NTD-directed neutralizing MAbs overlap highly. This results in the formation of antigenic supersite. Other than NTD-and RBD-directed MAbs, neutralizing MAbs that bind to the S2 region of S were also reported. However, S2-directed MAbs potency is very low. The studies that provide the structural information on NAbs obtained from COVID-19 that bound to the SARS-CoV-2 RBD, NTD or spike trimer offer a highly thorough picture of the response of the B cell to SARS-CoV-2 virus. For SARS-CoV-2, the NAbs may be short-lived. Moreover, they are not elicited in all patients. T cells also play an important role to combat this virus and form a long-term memory response. T cells tend to identify and recognize those parts in a virus that do not mutate rapidly. However, no comprehensive structural information is available on TCRs for the SARS-CoV-2 that bound to their peptide-MHC (pMHC) targets [49]. The main focus, till now, in immunity surveillance and vaccine development is on the role of NAbs while less importance is given to understand the T cells' role on non-NAbs that can offer protection through several mechanisms, such as Ab-dependent cellular cytotoxicity and opsonization.
Since late 2019, the SARS-CoV-2 virus has evolved a lot. All variants, particularly VOCs, contain multiple mutations in the S protein. Some changes were also found in the RBD, NTD and in the furin cleavage site between the S1 and S2 regions. Various variants of SARS-CoV-2 exhibit diverse mutation patterns in their S protein. These mutations gave this virus not only improved fitness but made it more infectious [59] and enhanced its ability for immune evasion. Because of this, SARS-CoV-2 shows increased resistance to antibody therapies and vaccines that were raised against the original strain [36,60] and against earlier VOCs. So far, the Omicron variant has undergone the most mutations: with more than 30 mutations in its S, 15 of them occur in the RDB. Omicron has the ability to nearly ablate the neutralizing effect of most FDA-approved antibody drugs, such as AZD1061, REGN10987, REGN10933, LY-CoV016 and Y-CoV555 [61]. The Omicron variant also reduces the effectiveness of approved vaccines, such as mRNA and inactivated virus vaccines [62]. Moreover, WHO Solidarity clinical trials [63] indicate that drugs, such as remdesivir, hydroxychloroquine and lopinavir, have minimal to no impact on COVID-19 patients who are hospitalized, as determined by overall mortality, the start of ventilation and length of stays at the hospital.
Thus, it is important, indeed critical, to continue the search for drugs and broadly NAbs and define their epitopes for the development of long-term and broad-spectrum antiviral therapies for guiding the development of an effective vaccine for SARS-CoV-2. Several drug administration agencies have authorized and advised a third booster dose for all adults to battle the current revival of the epidemic since three doses can neutralize the Omicron with a 40-fold drop in viral titer, although two doses are less effective [64]. We believe that effective vaccine development, drug repurposing and other treatment therapies that can offer long-term immunity against the Omicron and all previous (and future) variants will be hot zones for future research directions. This is also evident from the deposition of more S protein structures in PDB that belong to Omicron. Moreover, recent works focused on finding antibodies for the Omicron S protein of SARS-CoV-2. However, more research is needed into the therapeutic implications of natural and vaccine-induced immunity in connection to the defense against infection and life-threatening illness. More contributions are also required on the role of the T cell to the host immune response (T-cell immunity). This may allow early, comprehensive and long-term as well as durable protection from SARS-CoV-2, particularly from VOCs and from emerging variants in the future.

Conclusions
Understanding the S protein structures of SARS-CoV-2 is critical to properly understand this virus and its main functionalities and for the development of effective drugs that block the S protein interaction and binding with the ACE2 receptor. In this work, we compare and analyze the S protein structures of SARS-CoV-2 that were deposited in the PDB. Moreover, the studies from PDB that deposited the S protein structures of SARS-CoV-2 were reviewed to classify the structures into variants and drug types. We found that most of the work used NAbs, particularly, MAbs, to neutralize the entry of the S protein in the host cell. The SARS-CoV-2 mortality rate is low but it is highly transmittable and new variants of concern, particularly the current VOC Omicron makes this virus more resistant against neutralizing antibodies that are already authorized and in use. VOCs carrying mutations in the binding domains, such as RBD and NTD, of S reduce the effectiveness of many NAbs and vaccines by evading neutralization. Thus, it is imperative to find and develop broadly neutralizing NAbs that can offer long-term and broad-spectrum antiviral therapies against VOCs, less susceptible to resistance and guide the design of an effective vaccine for SARS-CoV-2. Moreover, the T-cell immunity and its importance to control SARS-CoV-2 may have been underestimated thus far and needs more attention from academia.
For future work, an extension is to include the S protein structures of SARS-CoV-2 deposited to EMDB and to comparatively analyze the protein structures for the main protease (M pro ) of SARS-CoV-2. Moreover, performing a bibliometric analysis of the literature related to the SARS-CoV-2 S protein structures is also an interesting area. The literature can be retrieved from online scholarly databases such as Web of Science (WoS) and Scoups, and various software's such as VOSviewer can be used for bibliometric analysis. We are more interested in the future to check the effectiveness of known drug ligands on the S protein structures for SARS-CoV-2 variants using the molecular docking and molecular dynamics simulations. This will allow to investigate whether the performance of ligands on different variants and their lineages is the same or not.