Potential Therapeutic Target and Vaccines for SARS-CoV-2

The coronavirus has become the most interesting virus for scientists because of the recently emerging deadly SARS-CoV-2. This study aimed to understand the behavior of SARS-CoV-2 through the comparative genomic analysis with the closest one among the seven species of coronavirus that infect humans. The genomes of coronavirus species that infect humans were retrieved from NCBI, and then subjected to comparative genomic analysis using different bioinformatics tools. The study revealed that SARS-CoV-2 is the most similar to SARS-CoV among the coronavirus species. The core genes were shared by the two genomes, but there were some genes, found in one of them but not in both, such as ORF8, which is found in SARS-CoV-2. The ORF8 protein of SARS-CoV-2 could be considered as a good therapeutic target for stopping viral transmission, as it was predicted to be a transmembrane protein, which is responsible for interspecies transmission. This is supported by the molecular interaction of ORF8 with both the ORF7 protein, which contains a transmembrane domain that is essential to retaining the protein in the Golgi compartment, and the S protein, which facilitates the entry of the coronavirus into host cells. ORF1ab, ORF1a, ORF8, and S proteins of SARS-CoV-2 could be immunogenic and capable of evoking an immune response, which means that these four proteins could be considered a potential vaccine source. Overall, SARS-CoV-2 is most related to SARS-CoV. ORF8 could be considered a potential therapeutic target for stopping viral transmission, and ORF1ab, ORF1a, ORF8, and the S proteins of SARS-CoV-2 could be utilized as a potential vaccine source.


Introduction
The recently established SDGs (Sustainable Development Goals) in 2015 aim to address the systemic barriers to social, economic, and environmentally sustainable development with a universal application under the premise of an interconnected, growing world [1]. Since the adoption of the SDGs, numerous governments, UN agencies, and regional and international organizations have taken great steps to implement this ambitious global framework [1,2]. However, the emergency of the coronavirus disease 2019 (COVID- 19) pandemic has posed a significant challenge to achieving the SDGs, which are aimed to be achieved by 2030 [2].
For the first time in the year 1960, both in adults and children, human coronavirus was identified as a result of respiratory infection [6]. High scientific interest in CoV studies only arose when the first severe acute respiratory syndrome (SARS-CoV) appeared in 2002 [7,8]. Due to the global spread of SARS-CoV, approximately 8000 confirmed human cases and 774 deaths (approximately a 9.5 percent mortality rate) occurred [9,10]. In 2012 Middle East respiratory syndrome CoV (MERS-CoV) outbreak in Saudi Arabia heightened this interest, owing to the higher mortality rate (approximately 35%) compared to SARS-CoV [11].
SARS-CoV-2, a novel betacoronavirus detected in the Chinese province of Wuhan, has recently been linked to severe respiratory infections in humans. The global spread of SARS-CoV-2, with a high risk of human-to-human transmission, prompted the World Health Organization to declare a public health emergency of international concern on 30 January 2020. After that, the virus spread rapidly beyond China, and the WHO declared the coronavirus disease (COVID-19) a pandemic on 11 March 2020 [12]. More than 655 million confirmed COVID-19 cases, with over 6.5 million deaths worldwide, had been reported by 15 December 2022 [13].
Coronavirus genomes are the largest among RNA viruses, ranging from 26 to 32 kilobases in size. These genomes have four major structural proteins: the spike (S), membrane (M), envelope (E), and nucleocapsid (N). The S protein mediates the virus's attachment to host cell surface receptors, resulting in fusion and subsequent viral entry. The M protein defines the shape of the viral envelope and is the most abundant protein [14]. The E protein is the smallest of the major structural proteins and participates in viral assembly and budding. The N protein is the only one that binds to the RNA genome and is involved in viral assembly and budding [15,16]. Coronaviruses have a number of nonstructural and accessory proteins, including Orf1ab, Orf3a, Orf6, Orf7a, Orf10, and Orf8 [17,18]. If their structures are characterized and their mechanisms of action and roles in viral replication are recognized, this will result in an increase in the number of suitable therapeutic targets [15]. Among nonstructural proteins, researchers have paid more attention to the Orf8 protein because it enhances viral replication and affects DNA synthesis and degradation of E proteins [17].
Coronaviridae members implicated in human infection show several similarities regarding genome structure [19]. Therefore, the aim of this study was to understand the behavior of SARS-CoV-2 through comparative genomic analysis with the closest one among  [20]. Both the previous and upcoming steps were used to compare the sequences, discover similarities, differences, and evolutionary distance. The evolutionary trees were constructed using the neighbor-joining, UPGMA, minimum evolution, maximum likelihood, and maximum parsimony methods in MEGA7 (molecular evolutionary genetics analysis) software version 7.0 for larger datasets (https://www.megasoftware.net/, accessed on 25 December 2022) [21]. Bootstrap statistic method was used for each method of tree construction to show the confidence levels of branching or building the evolutionary trees. Bootstrapping values reflect how many times out of 100 the same branch appeared, while the phylogenetic analysis was replicated [22].

Comparative Genomic Analysis of SARS-CoV-2 with the Most Relevant One
These steps were used to compare SARS-CoV-2 with the closest one (based on phylogenetic analysis). In the beginning, GeneCo software was used to analyze multiple genome structures by using Genebank format as an input file (https://bigdata.dongguk. edu/geneCo/#/index/main, accessed on 15 January 2023). Then, nucleotide sequence statistics (general sequence information, counts of atoms, nucleotide frequencies, and comparison elements) were generated using CLC Genomics Workbench 20. Pairwise alignment between the two genomes was performed to explore conservation of synteny, in the context of the entire sequences and their annotation by using ACT: the Artemis Comparison Tool (http://sanger-pathogens.github.io/Artemis/ACT/, accessed on 15 January 2023) [23], and BLAST: Basic Local Alignment Searching Tool (https://blast.ncbi.nlm.nih.gov/Blast.cgi, accessed on 15 January 2023).

Low Similarity Region Analysis
There were three regions of low similarity, regions 1 and 2 contain similar genes in both genomes, whereas region 3 contains genes that are specific for each genome. Analysis of these regions was divided into two parts. The first one was for regions 1 and 2, and the second was for region 3.
Concerning similar genes/proteins within two compared genomes, identity, difference, number of gaps, and evolutionary distance were calculated using CLC Genomics Workbench 20.0 and MEGA version 7. PROFphd software (PredictProtein server) was used for conversion of primary to secondary protein structures (https://predictprotein.org/, accessed on 15 January 2023) [24]. Homology modeling for tertiary structure of spike proteins was performed using SWISS-Model server [25]. Building a homology model embraces four main steps: (i) identification of structural template(s), (ii) alignment of target sequence and template structure(s), (iii) model-building, and (iv) model quality evaluation. Each model is evaluated with three methods as follows: quaternary structure quality estimate QSQE (a score is a number between 0 and 1, the larger number is better), global model quality estimation GMQE (the score is expressed as a number between 0 and 1, larger numbers indicate higher reliability), and qualitative model energy analysis (QMEAN), which is a composite estimator based on different geometrical properties and provides both global (for the entire structure) and local (per residue) absolute quality estimates on the basis of one single model. For models with greater than 100 residues, the QMEAN score must be greater than −5. SWISS-Model server is available at: https://swissmodel.expasy.org/, accessed on 15 January 2023. After that, TM-align algorithm (https://zhanglab.ccmb.med.umich.edu/TM-align/, accessed on 15 January 2023) was used to compare the spike protein structures of SARS-CoV and SARS-CoV-2 of unknown equivalence [26]. An optimal superposition of the two structures was built on the detected alignment was returned, as well as the TM-score value, which scales the structural similarity. TM-score has a value between 0 and 1, where 1 indicates a perfect match between two structures. Scores below 0.2 correspond to randomly chosen unrelated proteins, while those higher than 0.5 assume generally the same fold, based on SCOP and CATH, respectively, which are the two most prominent protein structure classification schemes [27]. Furthermore, the antigenicity of all proteins in regions 1 and 2 was predicted for two reasons, firstly, as a comparative factor, and secondly, to predict the protective antigens. Antigenicity was predicted using a couple of tools: a commercial CLC Genomics Workbench 20 that displays the results as a plot and the publicly available VaxiJen version 2.0 (http://www.ddg-pharmfac.net/vaxijen/VaxiJen/VaxiJen.html, accessed on 20 January 2023), which provides the findings as an overall prediction score. VaxiJen has a threshold for each model (virus, bacteria, parasite, fungal, or tumor), score below the threshold will be predicted as nonantigen, and if higher, it will be predicted as antigen.
Regarding region 3, which contains genes that are specific to each genome, there is no scope for comparison. As these proteins are hypothetical, they were first subjected to comparison with the proteins in the Universal Protein Resource (UniProt: https://www. uniprot.org/, accessed on 20 January 2023) by using the BLASTp algorithm.
Due to the lack of data within the main databases (NCBI and UniProt), other tools were used to predict a variety of information about their properties, functions, structures, etc. PredictProtein server was used to predict proteins secondary structures. Proteins structure features and annotations were predicted using PSIPRED server (http://bioinf.cs.ucl.ac.uk/ psipred/, accessed on 20 January 2023) [28]. Furthermore, MEMSAT-SVM tool (available within the PSIPRED server) was used to predict transmembrane protein topology [29]. This method is capable of differentiating signal peptides from transmembrane helices. Then, many algorithms and databases were used for the prediction of more information about proteins' functions: Pfam (http://pfam.xfam.org/, accessed on 20 January 2023), InterPro For the prediction of protein structures in the third region, the Swiss-Model server was used because it uses the homology modeling method, which is the most accurate when the target and template have similar sequences. Due to the lack of structural data for these proteins, additional servers with different based methods were used: DMPfold (http://bioinf.cs.ucl.ac.uk/psipred/, accessed on 20 January 2023), I-TASSER (https:// zhanglab.ccmb.med.umich.edu/I-TASSER/, accessed on 20 January 2023), and Robetta (https://robetta.bakerlab.org/, accessed on 20 January 2023). Finally, PROSESS (protein structure evaluation suite and server) was used to evaluate and validate protein structures. PROSESS integrates a variety of previously developed, well-known, and thoroughly tested methods to evaluate both global and residue-specific quality: (i) covalent and geometric quality; (ii) nonbonded/packing quality; (iii) torsion angle quality; (iv) chemical shift quality, and (v) NOE quality. Server available at: http://www.prosess.ca/index.php, accessed on 20 January 2023).

Whole Genome Analysis
In this study, we endeavored to provide a deep understanding of the SARS-CoV-2 through general genomic comparison with seven coronavirus species infecting humans and to a deep level with the closest one. The analysis was performed at the level of genomes, genes, and proteins. Pairwise alignment and evolutionary distance of the eight species have shown that SARS-CoV has the highest identity and the lowest distance in comparison to SARS-CoV-2 (Table 1). Genomic evolutionary trees were constructed using five different methods, with a bootstrapping value of 100 to provide accurate and confident branching, as shown in Figure 1. All methods have shown that SARS-CoV-2 is the most similar to the SARS-CoV species. Our findings support the research findings of Ahmed SF [30], Petrosillo N, and his colleagues [31]. Comparative genomic analysis of SARS-CoV-2 with SARS-CoV revealed that the two genomes seemed to have a high similarity; the core genes were shared by both genomes, but there were some genes found in one of them but not in both (three low-match regions) (Figures 2 and 3).

Low Similarity Region Analysis
Some differences existed regarding gene location, sequence, and consequently gene structure, such as the Orf1ab and spike S genes. The genes in three low-match regions were Orf8 in SARS-CoV-2, and Sars8a, and Sars8b in SARS-CoV (Table 2).
12, x FOR PEER REVIEW 6 of 22 Figure 1. This is diagram shows phylogenetic trees of eight whole genomes of coronavirus species using MEGA7. The evolutionary history for each of the five bootstrap consensus trees (A-E) was inferred respectively using neighbor-joining, UPGMA, minimum evolution, maximum likelihood, and maximum parsimony methods. The bootstrap consensus tree inferred from 100 replicates and the percentage of replicate trees in which the associated taxa clustered together in the bootstrap test (100 replicates) is shown next to the branches. Comparative genomic analysis of SARS-CoV-2 with SARS-CoV revealed that the two genomes seemed to have a high similarity; the core genes were shared by both genomes, but there were some genes found in one of them but not in both (three low-match regions) (Figures 2 and 3). . This is diagram shows phylogenetic trees of eight whole genomes of coronavirus species using MEGA7. The evolutionary history for each of the five bootstrap consensus trees (A-E) was inferred respectively using neighbor-joining, UPGMA, minimum evolution, maximum likelihood, and maximum parsimony methods. The bootstrap consensus tree inferred from 100 replicates and the percentage of replicate trees in which the associated taxa clustered together in the bootstrap test (100 replicates) is shown next to the branches.      The variation of genes is partly consistent with the research performed by Shereen MA. et al. [32], who reported the presence of Orf3 protein and absence of Orf10 protein in SARS-CoV-2. Most of the nucleotide sequence statistics presented in Table 3 and Figure 4 (length,      Analysis of region 1 (less similar genomic regions) between the two interested genomes revealed that regions 1 and 2 showed gene identity between 72 and 80 percent. The identity of their protein products ranges from 76 to 86 percent (Table 4). Computational proteomics analysis for nonstructural proteins Orf1ab and Orf1a, and structural S proteins demonstrated the great similarity between the relevant comparative proteins at the primary, secondary, and tertiary structural levels (Tables 5 and 6, and Figures 5-7). These findings reinforce the hypothesis of similarity between these species, and this overlaps with findings achieved by Ceraolo C. and Giorgi FM [33]. Table 5. Physicochemical parameters of homologous proteins in the first and second low-match regions.  Table 6. General information of nonhomologues genes/proteins found within the third low-match regions.    From an immunogenic point of view, Orf1ab, Orf1a, and the S proteins of SARS-CoV-2 could be antigenic and capable of exciting the immune system, which means these three proteins could be considered as potential sources of vaccine. The highest score (0.4787) was for Orf1a. The results of the antigenicity test are shown in Figure 7.

SARS-CoV ID SARS-CoV-2 ID
The third region contains various genes that are found in one species but not in both, which excludes the possibility of comparison. Genes located in this region are Orf8a(Sars8a) and Orf8b (Sars8b) in SARS-CoV, and Orf8 in SARS-CoV-2 (Table 6). In order to obtain additional information on the protein products of the previous genes, they were compared with a universal database of proteins (UniProt) ( Table 7).   From an immunogenic point of view, Orf1ab, Orf1a, and the S proteins of SARS-CoV-2 could be antigenic and capable of exciting the immune system, which means these three proteins could be considered as potential sources of vaccine. The highest score (0.4787) was for Orf1a. The results of the antigenicity test are shown in Figure 7.
The third region contains various genes that are found in one species but not in both, which excludes the possibility of comparison. Genes located in this region are Orf8a(Sars8a) and Orf8b (Sars8b) in SARS-CoV, and Orf8 in SARS-CoV-2 (Table 6). In order to obtain additional information on the protein products of the previous genes, they were compared with a universal database of proteins (UniProt) ( Table 7). Due to the lack of information in the UniProt database, many additional tools were used. The secondary structure of these proteins was predicted (Figure 8), and the physicochemical parameters were calculated as shown in Table 8.
Annotation of the three proteins predicted that they consist of extracellular, membrane interaction, cytoplasmic, and transmembrane elements (Figures 9-11).
Previous findings were consistent with the analysis carried out by Park MD [34]. The predicted functions of these proteins, which are set out in Table 9, were consistent with two studies: the first was conducted by Lau SKP et al., who indicated that Orf8 could be essential for interspecies transmission [35], and the second was accomplished by Keng CT and Tan YJ, who indicated that Orf8a and Orf8b contribute significantly to viral replication and/or in vivo pathogenesis [36,37]. The subcellular locations of these proteins support their predicted roles. Annotation of the three proteins predicted that they consist of extracellular, membrane interaction, cytoplasmic, and transmembrane elements (Figures 9-11).    Annotation of the three proteins predicted that they consist of extracellular, membrane interaction, cytoplasmic, and transmembrane elements (Figures 9-11).       Table 9. Prediction of proteins' function, antigenicity, and subcellular location using various resources.

Databases/Server Orf8a Orf8b Orf8
Function Pfam database Nonstructural proteins (8a, 8b, and 8, respectively). This family of proteins is functionally uncharacterized. This protein is found in coronaviruses. Proteins in this family are typically between 39 and 121 amino acids in length. This protein has two conserved sequence motifs: EDPCP and INCQ.

InterPro database
These proteins have two conserved sequence motifs: EDPCP and INCQ. They may modulate viral pathogenicity or replication in favor of human adaptation. ORF8 was suggested as one of the relevant genes in the study of human adaptation to the virus. Figures 12 and 13 presented the interaction of two of the target proteins, and both agreed on the following: (i) interaction between Orf8a/Orf8b, (ii) interaction with proteins that have a role in replication, such as Orf1ab [32], (iii) interaction with proteins that play a role in antiviral signaling and suppressing innate immunity (Orf9b) [38].
The Orf8b protein also has an interaction with the Orf7b protein (ns7b), which contains transmembrane domains that are essential for retaining the protein in the Golgi compartment [39], and the S protein (spike), which facilitates the entry of coronavirus into the host cells [40]. Likewise, Orf8 shows molecular interactions with more than 80 genes, as presented in Figure 13.
These molecular interactions are consistent with the proteins' functions previously expected. Eventually, the protein models predicted by the Robetta server ( Figure 14) showed the highest quality score and full-length coverage, as shown in Table 10.  Figures 12 and 13 presented the interaction of two of the target proteins, and both agreed on the following: (i) interaction between Orf8a/Orf8b, (ii) interaction with proteins that have a role in replication, such as Orf1ab [32], (iii) interaction with proteins that play a role in antiviral signaling and suppressing innate immunity (Orf9b) [38]. The Orf8b protein also has an interaction with the Orf7b protein (ns7b), which contains transmembrane domains that are essential for retaining the protein in the Golgi compartment [39], and the S protein (spike), which facilitates the entry of coronavirus into the host cells [40]. Likewise, Orf8 shows molecular interactions with more than 80 genes, as presented in Figure 13.    Figures 12 and 13 presented the interaction of two of the target proteins, and both agreed on the following: (i) interaction between Orf8a/Orf8b, (ii) interaction with proteins that have a role in replication, such as Orf1ab [32], (iii) interaction with proteins that play a role in antiviral signaling and suppressing innate immunity (Orf9b) [38]. The Orf8b protein also has an interaction with the Orf7b protein (ns7b), which contains transmembrane domains that are essential for retaining the protein in the Golgi compartment [39], and the S protein (spike), which facilitates the entry of coronavirus into the host cells [40]. Likewise, Orf8 shows molecular interactions with more than 80 genes, as presented in Figure 13.  These molecular interactions are consistent with the proteins' functions previously expected. Eventually, the protein models predicted by the Robetta server ( Figure 14) showed the highest quality score and full-length coverage, as shown in Table 10.

Conclusions
We concluded that SARS-CoV-2 is the most similar to SARS-CoV among all coronavirus species infecting humans. The core genes were shared by the two genomes, but there were some genes in one of them but not in both, such as ORF8, which is found in SARS-CoV-2 but not in SARS-CoV. The ORF8 protein of SARS-CoV-2 could be considered a good therapeutic target for stopping viral transmission, as it is predicted to be a transmembrane protein, which is responsible for interspecies transmission. ORF1ab, ORF1a, ORF8, and S proteins of SARS-CoV-2 could be immunogenic and capable of exciting the immune system, which means these proteins could be considered potential sources of a vaccine.
The findings of the present study will contribute to the containment of SARS-CoV-2 and may assist other researchers in getting an in-depth understanding and analysis of SARS-CoV-2.