Bioinformatics Prediction of SARS-CoV-2 Epitopes as Vaccine Candidates for the Colombian Population

Coronavirus disease (COVID-19) pandemic caused by the coronavirus SARS-CoV-2 represents an enormous challenge to global public health, with thousands of infections and deaths in over 200 countries worldwide. The purpose of this study was to identify SARS-CoV-2 epitopes with potential to interact in silico with the alleles of the human leukocyte antigen class I (HLA I) and class II (HLA II) commonly found in the Colombian population to promote both CD4 and CD8 immune responses against this virus. The generation and evaluation of the peptides in terms of HLA I and HLA II binding, immune response, toxicity and allergenicity were performed by using computer-aided tools, such as NetMHCpan 4.1, NetMHCIIpan 4.0, VaxiJem, ToxinPred and AllerTop. Furthermore, the interaction between the predicted epitopes with HLA I and HLA II proteins frequently found in the Colombian population was studied through molecular docking simulations in AutoDock Vina and interaction analysis in LigPlot+. One of the promising peptides proposed in this study is the HLA I epitope YQPYRVVVL, which displayed an estimated coverage of over 82% and 96% for the Colombian and worldwide population, respectively. These findings could be useful for the design of new epitope-vaccines that include Colombia among their population target.


Introduction
Coronavirus disease (COVID- 19), was declared a global pandemic by the World Health Organization on 11 March 2020. This infection has affected more than 200 countries [1], with over 183 million cases and 3971,687 deaths worldwide by 6th July, 2021 [2]. At this time, the Region of Americas continues to account for around 50% of all deaths and 40% of all cases worldwide [2], Colombia being the 11th country with the highest number of cumulative cases and deaths [3]. This disease caused by the coronavirus SARS-CoV-2 exhibits a wide range of manifestations from non-symptomatic and mild illness (mainly associated with cough, fever, fatigue, sore throat, headache and muscle pain) to pneumonia and acute respiratory distress syndrome [4]. This is characterized by lung collapse, the requirement of ventilatory assistance and oxygen support, and has been related to multiorgan collapse and hyperinflammatory states in extremely severe cases [5,6]. The latter is mediated by a cytokine storm, which could be induced by the nucleocapsid protein (N), and to a lesser extent by the spike protein (S) of SARS-CoV-2 [7].
The long-term immunity of the vaccines and their effectiveness against reinfections of SARS-CoV-2 is still uncertain [8]. Some authors suggested that the virus is likely to continue present in the population [9]. Therefore, the improvement of vaccine development and production capacities in several countries and continents is essential. Especially in Latin America, which is one of the most affected areas by COVID-19 pandemics.
SARS-CoV-2 belongs to the beta genus of the Coronaviridae family, a group of single stranded positive sense RNA viruses able to affect humans and animals [10]. The name of

Literature Search of HLAs Frequencies
In order to identify HLAs (HLA I and HLA II) commonly found in the Colombian population, a literature search was performed on PubMed (https://pubmed.ncbi.nlm.nih. gov/, accessed on 24 February 2021), Web of Science (http://www.webofknowledge.com/, accessed on 24 February 2021), and Science Direct (https://www.sciencedirect.com/, accessed on 24 February 2021). A single query was utilized to search for articles reporting HLA I frequencies in the Colombian population (Table 1). On the other hand, HLA II genes reported in the IPD-IMGT/HLA database (http://www.ebi.ac.uk/ipd/imgt/ hla/, accessed on 7 February 2021) were considered for the identification HLA II allelic frequencies in Colombia. The name of each gene was used to perform a preliminary search on PubMed, by using the generic query: "Name of the HLA Class II gene" AND "Colombia" (Eg. "DRA" and Colombia"). Accordingly, the genes: DRA, DQA2, DPA2, DPB2, DMA, DMB, DOA, DOB, DRB2, DRB6, DRB7, DRB8 and DRB9 did not present any results about their frequency in the Colombian population. Therefore, these were not included in the final queries to select articles reporting the frequencies of HLA II alleles. The literature search for this type of alleles were divided in three queries as the maximum number of Boolean connectors (AND/OR) allowed in Science Direct was eight (Table 1).

Search Query
HLA Class I ("MHC Class I" OR "MHC I" OR "HLA Class I" OR "HLA I" OR "HLA-A" OR "HLA-B" OR "HLA-C") AND "Colombia" HLA Class II Query #1: ("MHC Class II" OR "MHC II" OR "HLA Class II" OR "HLA II") AND "Colombia" Query #2: ("DRB1" OR "DRB3" OR "DRB4" OR "DRB5" OR "DQB1" OR "DQA1") AND "HLA" AND "Colombia" Query #3: ("DPA1" OR "DPB1") AND "HLA" AND "Colombia" Research and review articles, as well as short/brief communications in English and Spanish were considered. The outcomes for HLA I and HLA II alleles were processed separately. The results obtained in PubMed, Web of Science and Science Direct for the corresponding queries were downloaded in the reference formats: BibTex (Web of Science and Science Direct) and nbib (PubMed). Subsequently, two folders (HLA I, and HLA II) were created in the reference manager Mendeley Desktop v 1.19.4, and the reference files uploaded accordingly. The tool "Check for duplicates" of this software was used to identify and delete redundant articles.
The information contained in the title and abstract was used to perform a first screening and to remove the articles that did not present data about the prevalence of HLA I and HLA II alleles in the Colombian population. In addition, the reports of allelic frequencies in groups with associated diseases (Eg. lupus, arthritis, diabetes, hepatitis, autoimmune diseases, multiple sclerosis) were not considered for further analysis.
On the other hand, additional datasets were retrieved from the Allele frequency net database (AFND) [28], and included in this study. The HLA searching option used in this database was "HLA classical allele freq search" with the following parameters, country: Colombia, population standard: gold and silver, sort by: Allele (highest to lowest frequency), and level of resolution: two field. The options: population, source of the dataset, ethnic origin, region, type of the study, sample year, sample size, and frequencies were configured to show all the possible results.
The selected articles from PubMed, Web of Science, Science Direct and AFND were downloaded in pdf format, carefully reviewed, and used to generate an excel file with HLA I and HLA II allelic frequencies reported in Colombia. The cumulative frequency of each dataset was calculated through a customized python script (Python_Script_CumFreq.py, available in the Supplementary Materials), and only articles reporting allelic frequencies Vaccines 2021, 9, 797 4 of 21 whose sum was 100 ± 2% were kept, except for DRB2-DRB5, as these genes are not expected to be present in all individuals. Furthermore, alleles with low resolution were removed (less than two field resolution). The number of results obtained of all the searches were last updated by 24 February 2021, and the process was documented through a PRISMA flow diagram. The HLA allelic frequencies obtained through the systematic search were sorted in descending order.

HLAs Selection for Epitope Prediction
The dataset of HLA I alleles used for epitope prediction included: (1) HLA-A, HLA-B and HLA-C alleles that presented the top ten highest frequencies in the Colombian general population (Dataset: Colombia-Bogotá). As well as, (2) the HLA-A, HLA-B and HLA-C alleles that exhibited the highest frequency for each of the eleven Colombian Amerindian groups with HLA I frequencies reported in the AFND [28], which coincided with the results of the literature search [29][30][31]. These Amerindian groups were: Arhuaco, Embera, Inga, Kogi, Chimila Norte, Wiwa Norte, Waunana, Wayyu, Zenú, Ticuna Arara, and Ticuna Tarapaca. Weighted allele frequencies (WAFs) were not calculated for HLA I alleles in the Colombian population as each dataset came from a single article [32]. Unfortunately, no reports about the frequency of HLA I alleles in African Colombians were found.
Due to the extensive amount of data, a python script was developed to obtain the WAFs by calculating the weighted average of HLA II allele frequencies for the Colombian population grouped by ethnicity (Mestizos, African Colombians and Colombian Amerindians); and each of the reported alleles were expressed in two-field format (Python_Script_WAF.py, available in the Supplementary Materials). These scripts were used to generate an excel table containing the WAFs and number of individuals with a specific HLA II allele in each of the three ethnic groups considered, as well as the reported frequencies for HLA I alleles. All the HLA II alleles exhibiting more than 5% in at least one of the studied ethnic groups were selected for epitope prediction and in silico evaluation. These were also used to perform a Venn diagram (http://bioinformatics.psb.ugent.be, accessed on 16 July 2021) in order to distinguish HLA II alleles with high frequencies in several ethnic groups.

T-Cell Epitope Prediction
The prediction of CD4 and CD8 T-cell epitopes was conducted by using the NCBI reference sequence of non-structural proteins SARS-CoV-2 ( Table 2). In addition to the selected HLA I and HLA II alleles commonly found in the Colombian population. HLA I epitope predictions were performed on NetMHCpan 4.1 [33] by using the following parameters, peptide length: 8-12, threshold for strong binder: 0.5% rank, threshold for weak binder: 2% rank, inclusion of theoretical binding affinity (predicted IC 50 values). Short peptides (8-12 amino acids) were generated and evaluated to identify candidate epitopes with high affinity for the alleles HLA-A, HLA-B and HLA-C commonly found in the Colombian population in this server. On the other hand, HLA II epitope predictions were based on DRB1 alleles highly frequent in the Colombian population and performed on a NetMHCIIpan 4.0 server [33]. The parameters employed were peptide length of 15 amino acids, threshold for strong binder of 1% rank, threshold for weak binder of 5% rank, and inclusion of the binding affinity predictions. Data analysis was performed in Python 3 through customized scripts. The function of these scripts were to group the epitopes predicted by NetMHCpan 4.1 [33] and NetMHCI-Ipan 4.0 [33] servers as strong or weak binders, and to retrieve the names and number of the interacting alleles per peptide. Both NetMHCpan 4.1 [33] and NetMHCIIpan 4.0 [33] reported if the sequence of each peptide was a strong binder (SB) or a weak binder (WB) with each of the HLAs included in the analysis [33]. These data were retrieved in a column called "bind level" in the result table generated by these servers. The scripts developed in this research used the resultant tables as input files to count the number of HLAs interacting as SB or WB with each peptide by using the "groupby()" function to group the data by both "peptide" and "bind level" at the same time. Subsequently, the scripts counted the number of HLAs in each group [HLAs with the same peptide and bind level (SB or WB)] with the "count()" function to retrieve the number and names of the HLAs interacting with each peptide as SB or WB. The top ten peptides with the highest number of interacting alleles with strong affinity were kept for further analysis (Analysis_HLA_I.py, Analysis_HLA_II.py, and Interactions_Summary.py). These are available in the Supplementary Materials.
In addition, the coverage of the promising epitopes for the worldwide population was predicted by using the Population Coverage Calculation Tool of IEDB (http://tools. iedb.org/population/, accessed on 31 March 2021), with the following parameters: Class I and Class II combined and area: world. The information of the MHC restricted epitopes was completed with the HLAs predicted to interact with these peptides (strong or weak binders) by NetMHCpan 4.1 [33] and NetMHCIIpan 4.0 [33].
Each promising peptide was predicted to interact with several HLAs. Therefore, a Coverage Score (CS) was defined to a calculated single value representing the coverage in the Colombian general population in the case of HLA I alleles, and the coverage per each ethnic group in the case of HLA II. Information regarding allelic frequencies for HLA I was very scarce (each dataset came from a single article), therefore WAF were not calculated and the estimated coverage was defined for HLA-A, HLA-B and HLA-C as the cumulative frequencies (sum of the frequencies) of the alleles interacting as strong or weak binders with each peptide, in Colombian general population. On the other hand, the estimated coverage for HLA II (DRB1) was calculated as the WAF of the interacting alleles per peptide. The calculated coverage scores were useful to selected candidate peptides with the highest estimated coverages. However, the low number of articles reporting HLAs frequencies for the Colombian population is a limitation and may affect the accuracy of the estimates.
The customized scripts developed to calculate estimated coverages used the "groupby()" function to group the data according to the peptide sequence, bind level and type of HLA (HLA-A, HLA-B, HLA-C and HLA-DRB1; other HLA II alleles were not considered as they are not expected to be present in all individuals),as well as, the sum the frequencies of the interacting alleles for HLA I and the WAF for HLA II.
Promising epitopes were submitted to Vaxijen v. 2.0 [34] to assess their immunogenicity in silico, by selecting viruses as target entities and default threshold. In addition, Allertop 2.0 [35] was used to predict allergenicity, and Toxinpred [36] was utilized to evaluate the theoretical toxicity with default parameters. Among these, SVM (Swissprot) based method, E-value cut-off for motif-based method of 10, SVM threshold of 0, and calculation of the following physicochemical properties: hydrophobicity, charge and molecular weight. In addition, the potential of the promising epitopes to induce the release of TNF gamma was evaluated in silico by using IFNepitope server (http://crdd.osdd.net/raghava/ifnepitope/, accessed on 31 March 2021).

Peptide-Protein Docking Studies
Theoretical binding affinities were calculated by molecular docking simulations to determine the possible interaction between the promising peptides and HLAs commonly found in the Colombian population. In order to do that, a blind docking strategy was used in AutoDock Vina, this software calculates in silico binding affinities and retrieves information regarding the predicted pose and binding pocket of the peptides with the highest (absolute value) affinity scores. The structures of the promising epitopes were previously generated by modelling on Pep-Fold 3.0 server [37]. On the other hand, HLA I and HLA II selected to be highly frequent in the Colombian population with three-dimensional structures available in Protein Data Bank (PDB) [38] were downloaded in pdb format. The names of the proteins and their PDB identifiers (PDB ID) are available in Table S1. Subsequently, all ions, water molecules and other substructures were removed and the protein structures were prepared by using the biopolymer structure preparation tool of the in Sybyl X-2.0 (Tripos, St. Lous, MO, USA) with default settings. The resultant coordinates were optimized in the same software with the following parameters: Powell method, Kollman United and Kollman All Atoms force fields, AMBER charges, dielectric constant of 1.0, nonbonded (NB) cutoff of 8.0, maximum interactions of 100 and termination gradient of 0.001 kcal/mol. Finally, the size and coordinates of the center of the grid containing the whole protein structure were determined, by using a spacing of 0.375 Å, and the resultant structures saved as pdbqt in AutoDock Tools (MGL Tools) [39]. These parameters and files were used as input for docking in AutoDock Vina [40], along with the following settings: twenty number of modes, energy range of 1.5, and exhaustiveness of 25. The predicted docking affinity scores were ranked and used to identify the peptide-protein complexes with the highest (absolute value) affinity scores. In order to better visualize these results, a heatmap with clustering trees was generated with the heatmap.2 function of the statistical program R version 3.6.3. [41,42].

Interactions Analysis and Molecular Dynamics
The epitopes with the highest (absolute values) affinity scores predicted by AutoDock Vina were submitted to interaction analysis using LigPlot+ [43]. This program was utilized with default parameters. In addition, a short molecular dynamics (MD) simulation was performed to further study the interaction of the HLA-peptide complex containing the promising epitope obtained from the receptor-binding domain of the S protein of SARS-CoV-2 that presented the highest (absolute value) affinity score in silico. The MD was carried out in Gromacs (version 2020.2) [44], by using the Chemistry at Harvard Macromolecular Mechanics (CHARMM) force field [45]. The peptide-protein complex was solvated by placing it into the center of a cubic box filled with water, 1.0 nm from the boundaries of the complex. After that, ions were added to neutralize the system, followed by a constant pressure (NVT) equilibrium simulation for 1 ns with a time step of 2 fs and reference temperature of 300 K. A second equilibrium step was carried out for 1 ns by using a constant particle number, pressure, and temperature (NPT) ensemble. The production step of the MD simulation was executed during 10 ns under isothermal-isobaric conditions, with time step: 2 fs, reference temperature: 300 K, pressure 1 bar, van der Waals cutoff: 1.2 nm, and grid spacing: 0.16 nm using the leap-frog integrator and Verlet cutoff scheme. The atomic coordinates were recorded every 10 ps to obtain 1000 different molecular conformations. The same procedure was carried out with the peptide-free protein (HLA) for comparative purposes [46], by measuring the root-mean square deviations (RMSD). In addition, the root-mean square fluctuations (RMSF) of the residues of HLA-B*08:01 (backbone) were computed using the trajectories of the MD simulation.

Literature Search of HLAs Frequencies
The literature search for HLA I frequencies in the Colombian population retrieved 486 articles. These were obtained from PubMed, Web of Science and Science Direct by using the following query: ("MHC Class I" OR "MHC I" OR "HLA Class I" OR "HLA I" OR "HLA-A" OR "HLA-B" OR "HLA-C") AND "Colombia" (Table 3). On the other hand, the systematic search for HLA II frequencies in the Colombian population retrieved 1057 articles (Table 4). This search was carried out through the combination of three different queries in PubMed, Web of Science and Science Direct. Furthermore, a total of Vaccines 2021, 9, 797 7 of 21 12 and 39 datasets referring HLA I and HLA II frequencies in Colombia were retrieved from AFND [28], respectively.

HLA I Results
PubMed 78 Web of Science 116 Science Direct 292 After duplicates removal, and manual screening of the titles and abstracts, 15 and 24 articles were accessed for eligibility in the groups of HLA I and HLA II alleles, respectively. Only articles reporting HLAs allelic frequencies with two-field resolution and cumulated frequencies of 100 ± 2% were maintained. A PRISMA flow diagram showing the data collection process is presented in Figure 1.

HLA I
The total set of HLA I allelic frequencies reported for the Colombian population, including Colombian Amerindian groups, according to the systematic search are presented in Table S2. The HLA I allelic frequencies in the general Colombian population (Dataset: Colombia-Bogotá) obtained from AFND [28] were used to identify the 10 most frequent HLA-A, HLA-B and HLA-C alleles (Table 5). Similarly, the data corresponding to the Native American groups: Arhuaco, Embera, Inga, Kogi, Chimila Norte, Wiwa Norte, Waunana, Wayyu, Zenú, Ticuna Arara and Ticuna Tarapaca were used to identify the most frequent HLA-A, HLA-B and HLA-C alleles in each of these populations ( Table 6). The sum of the top-10 frequencies for the HLA-A, HLA-B and HLA-C alleles in the general Colombian population (Group: Colombia Bogotá) was 0.736, 0.523 and 0.778, respectively. All Colombian Native American groups showed HLA-A*24: 02 as the most frequent HLA-A allele, which is also the most common for the general population (Dataset: Colombia-Bogotá). On the other hand, the most frequent HLA-B and HLA-C alleles reported for Colombian Amerindian groups showed a greater variability.

HLA II
The complete set of HLA II allelic frequencies reported for the Colombian population, including Colombian Amerindian groups and African Colombians, are presented in Table S3. Furthermore, WAFs of HLA II in the Colombian population were calculated as the weighted average of the frequencies obtained from the literature search and AFND [28]. This information grouped by ethnicity and the specific alleles in two-field formats are presented in Table S4. Alleles with WAFs > 5% in each ethnic group (Mestizo, African American and Colombian Amerinds) were selected for further analysis (Table 7).

T-Cell Epitope Prediction
T cell epitopes were generated based on 34 HLA I and 19 HLA II alleles commonly found in the Colombian population and available in the servers used for epitope prediction (Table 8). These were the top-ten HLA-A, HLA-B and HLA-C most frequent alleles per type found in the Colombian general population (Dataset: Colombia-Bogotá) and the

T-Cell Epitope Prediction
T cell epitopes were generated based on 34 HLA I and 19 HLA II alleles commonly found in the Colombian population and available in the servers used for epitope prediction (Table 8). These were the top-ten HLA-A, HLA-B and HLA-C most frequent alleles per type found in the Colombian general population (Dataset: Colombia-Bogotá) and the most common HLA-A, HLA-B and HLA-C alleles found in each of the Colombian Amerindian groups with reports of HLA I. As well as, HLA II alleles reported to be present with high frequency (WAFs > 5%) in the Colombian population.
HLA I epitopes were generated based on the SARS-CoV-2 proteins S (Table S5), N (Table S6), E (Table S7) and M (Table S8). Similarly, HLA II epitopes were generated from each of these viral structural proteins (Tables S9-S12). The promising peptides (with the highest number of strong interactions for HLA-I and HLA-II commonly found in the Colombian population) that exhibited predicted immunogenicity, non-toxicity and non-allergenicity are shown in Table 9. Table 8. HLA I and HLA II alleles used in this study for T-cell epitope prediction.     Table 9. Cont. According to the immunoinformatics analysis of these peptides (Table 9), only four promising epitopes were predicted to induce the release of TNF gamma by the TNFepitope server (Table 10). The coverage calculation performed on IEDB for these epitopes showed that they are predicted to cover up to 96.62% of the worldwide population. These four promising epitopes are located in the S and N proteins of SARS-CoV-2. The peptides YQPYRVVVL and RAAEIRASANLAATK are placed in the receptor-binding domain (RBD) and the central helix (CH) of the S protein, respectively. On the other hand, SPDDQIGYY and QFAPSASAF are positioned in the N-terminal domain (NTD) and the C-terminal domain (CTD) of the N protein of SARS-CoV-2, respectively.

Peptide-Protein Docking Studies
Confirmatory peptide-protein docking studies were carried out with AutoDock Vina [40,47]. The predicted binding affinity scores (kcal/mol) of HLAs interacting with promising peptides that showed immunogenicity, non-toxicity and non-allergenicity in silico are presented in Table S13, and represented as a heatmap with dendrograms in Figure 3. All the studied peptides exhibited high (absolute value) affinity scores with at least one of the evaluated alleles. Besides, three of the promising peptides that were predicted to induce the release of TNF gamma (YQPYRVVVL, QFAPSASAF and SPDDQIGYY) showed a multi-target behavior, by interacting with most of the HLAs used for docking studies. These were used for further analysis along with the other peptide that was predicted to induce the release of TNF gamma in silico (RAAEIRASANLAATK).

Interactions Analysis and Molecular Dynamics
The interaction analysis between the promising peptides predicted to induce the release of TNF gamma and four of the most common HLAs in the Colombian population was carried out by using LigPlot+ [43]. The three-dimensional view of the complexes and the interactions between these peptides with HLA-A*24:02, HLA-B*51:04, HLA-C*04:01, and HLA-DQB1*06:02 are presented in Figure S1-S4, respectively. All the promising epitopes were predicted to interact with the peptide-binding cleft of these HLAs. In addition, the three-dimensional view of the complex formed by the promising epitope with the greatest estimated coverage (YQPYRVVVL) and the protein (HLA-B*08:01) that exhibited the highest (absolute value) affinity score with this peptide (−10.3 kcal/mol) is presented in Figure 4.

Interactions Analysis and Molecular Dynamics
The interaction analysis between the promising peptides predicted to induce the release of TNF gamma and four of the most common HLAs in the Colombian population was carried out by using LigPlot+ [43]. The three-dimensional view of the complexes and the interactions between these peptides with HLA-A*24:02, HLA-B*51:04, HLA-C*04:01, and HLA-DQB1*06:02 are presented in Figures S1-S4, respectively. All the promising epitopes were predicted to interact with the peptide-binding cleft of these HLAs. In addition, the three-dimensional view of the complex formed by the promising epitope with the greatest estimated coverage (YQPYRVVVL) and the protein (HLA-B*08:01) that exhibited the highest (absolute value) affinity score with this peptide (−10.3 kcal/mol) is presented in Figure 4.
The MD simulation ( Figure 5) confirmed the peptide induced conformational change that has been reported for the binding of epitopes with HLA I [48] and HLA II [46] proteins. The average RMSD of the atomic positions for the dynamics and static models of the protein-peptide complex and the peptide-free protein were 4.27 Å and 2.67 Å, respectively. The MD simulation ( Figure 5) confirmed the peptide induced conformational change that has been reported for the binding of epitopes with HLA I [48] and HLA II [46] proteins. The average RMSD of the atomic positions for the dynamics and static models of the protein-peptide complex and the peptide-free protein were 4.27 Å and 2.67 Å, respectively.   The MD simulation ( Figure 5) confirmed the peptide induced conformational change that has been reported for the binding of epitopes with HLA I [48] and HLA II [46] proteins. The average RMSD of the atomic positions for the dynamics and static models of the protein-peptide complex and the peptide-free protein were 4.27 Å and 2.67 Å, respectively.  The RMSF analysis (Figure 6) revealed the flexibility of HLA-B*08:01. The binding of the epitope YQPYRVVVL resulted in a similar fluctuation pattern with notorious differences in the RMSF values near the residues: ASP30, GLU58-ALA90, GLY104-ARG181, PRO193-GLU198 and ALA211-PRO276, which indicates that the binding to this epitope may influence conformational changes around these amino acids.
The RMSF analysis (Figure 6) revealed the flexibility of HLA-B*08:01. The binding of the epitope YQPYRVVVL resulted in a similar fluctuation pattern with notorious differences in the RMSF values near the residues: ASP30, GLU58-ALA90, GLY104-ARG181, PRO193-GLU198 and ALA211-PRO276, which indicates that the binding to this epitope may influence conformational changes around these amino acids.

Discussion
Immunoinformatics has been used for the prediction of epitopes of SARS-CoV2 [19,49,50], as T-cells may be crucial to combat this virus causing COVID-19 [19]. The state of art regardissssng HLAs frequencies in Latin America is very limited [21], which is concerning as this is one of the most affected areas for the pandemics. In this article, we performed a systematic review to find HLAs (HLA I and HLA II) allelic frequencies reported for the Colombian population. This expanded the number of organized datasets reporting HLAs allelic frequencies for the Colombian population from seven [21] to twelve for HLA I and seventy one for HLA II.
The design of novel vaccines or treatments against COVID-19 are needed to cover the worldwide demand, especially in developing countries as Colombia. A computational approach was used to predict SARS-CoV-2 epitopes, as this approach has been shown to speed up the screening process of peptide libraries [47]. Hereby, we report four promising epitopes that presented immunogenicity, non-toxicity, non-allergenicity and potential to release TNF-gamma in silico. These are YQPYRVVVL and RAAEIRASANLAATK, which are based on the S protein of SARS-CoV-2, as well as, QFAPSASAF and SPDDQIGYY which are based on the N protein of this virus. Both structural proteins, N and S, have been reported to present immunogenic activity.
The promising epitopes based on the S protein of SARS-CoV-2 (YQPYRVVVL and RAAEIRASANLAATK) proposed herein are conserved in the current variants of concern: Alpha (United Kingdom), Beta (South Africa), Gamma (Brazil) and Delta (India); as well as in all variants of interest: Eta (Multiple countries), Iota (United States of America), Kappa (India) and Lambda (Peru) [51]. The prioritization of epitopes like these that are conserved across variants of concern and interest of SARS-CoV-2 is crucial to prevent immune evasion due to viral genomic diversity [52].

Discussion
Immunoinformatics has been used for the prediction of epitopes of SARS-CoV2 [19,49,50], as T-cells may be crucial to combat this virus causing COVID-19 [19]. The state of art regardissssng HLAs frequencies in Latin America is very limited [21], which is concerning as this is one of the most affected areas for the pandemics. In this article, we performed a systematic review to find HLAs (HLA I and HLA II) allelic frequencies reported for the Colombian population. This expanded the number of organized datasets reporting HLAs allelic frequencies for the Colombian population from seven [21] to twelve for HLA I and seventy one for HLA II.
The design of novel vaccines or treatments against COVID-19 are needed to cover the worldwide demand, especially in developing countries as Colombia. A computational approach was used to predict SARS-CoV-2 epitopes, as this approach has been shown to speed up the screening process of peptide libraries [47]. Hereby, we report four promising epitopes that presented immunogenicity, non-toxicity, non-allergenicity and potential to release TNF-gamma in silico. These are YQPYRVVVL and RAAEIRASANLAATK, which are based on the S protein of SARS-CoV-2, as well as, QFAPSASAF and SPDDQIGYY which are based on the N protein of this virus. Both structural proteins, N and S, have been reported to present immunogenic activity.
The promising epitopes based on the S protein of SARS-CoV-2 (YQPYRVVVL and RAAEIRASANLAATK) proposed herein are conserved in the current variants of concern: Alpha (United Kingdom), Beta (South Africa), Gamma (Brazil) and Delta (India); as well as in all variants of interest: Eta (Multiple countries), Iota (United States of America), Kappa (India) and Lambda (Peru) [51]. The prioritization of epitopes like these that are conserved across variants of concern and interest of SARS-CoV-2 is crucial to prevent immune evasion due to viral genomic diversity [52].
The promising epitope YQPYRVVVL has been reported to exhibit high antigenicity against the beta variant from South Africa (GSAID ID: EPI_ISL_1706561) and another variant from India (GSAID ID: EPI_ISL_1708422) [53]. Furthermore, this peptide exhibited high binding affinity for several HLAs in silico and has been proposed as candidate epitope for vaccine design [54]. On the other hand, the promising HLA II epitope RAAEIRASAN-LAATK has been reported to exhibit a good coverage in other Latin American countries, including Argentina, Bolivia, Brazil, Chile, Ecuador, Paraguay, Peru and Venezuela [21]. Furthermore, the candidate epitopes based on the N protein of SARS-CoV-2 QFAPSASAF and SPDDQIGYY have been described as promising epitopes for the development of multi-epitope vaccines [55]. Therefore, the promising peptides described herein are not only restricted to the Colombian population, but also can be useful for the development of peptide-based vaccines for several countries.
According to the IEDB, the promising epitopes proposed in this article can exhibit up to 96.62% of coverage in worldwide population. In addition, the estimated coverage of the peptide VYDPLQPEL calculated for the Colombian population based on allelic frequencies indicated this could cover up to 82.07% of the population through its binding with HLA-C proteins, and present a coverage of 50.85% and 34.39% associated to its interaction with HLA-A and HLA-B in the Colombian population.
The structural analysis carried out with AutoDock Vina [40] and LigPlot+ [43] showed that the promising peptides interacted with the expected binding site of the studied HLAs, in the peptide-binding groove [56]. Most of the interactions where hydrophobic with the presence of some hydrogen bonds. In addition, the contact residues predicted for YQPYRVVVL with HLA-B*08:01 revealed the interaction of this promising epitope with two amino acids in the positions 156 and 116 that have been reported as crucial for peptide recognition of HLAs [56], ASP156 and TYR116.
The MD suggested a conformational change induced by the peptide binding, in the complex formed by the epitope with the highest coverage and the protein that presented the highest (absolute) value affinity score for it (VYDPLQPEL/HLA-B*08:01). This is in agreement with previous reports for similar systems of HLA proteins [46,48]. In addition, the RMSF pattern presented for the binding of the promising SARS-CoV-2 epitope VYDPLQPEL to HLA-B*08:01 is similar to the reported for the binding with a Barr virus peptide, which presented the same pattern and comparable values [57]. According to the RMSF analysis, the binding of the promising peptide increases the flexibility of the two alpha helices of HLA-B*08:01 (GLU58-ALA90 and ASP137-ARG181), as well as the region between the beta strands 2-3 (ASP30), 5-8 (GLY104-ALA136) and 9-10 (PRO193-GLU198); and a large portion of the α3-domain (ALA211-PRO276).
According to the aforementioned, the promising epitopes presented in this study may have an impact in the development of new peptide-based vaccines and diagnostic tests tended to cover Colombian and Latin American population, which also presented a good calculated coverage worldwide. However, further analysis is required and these peptides are proposed as candidates to be submitted to in vitro and in vivo tests.

Conclusions
This in silico study presents promising T-cell epitopes based on structural proteins of SARS-CoV-2 and HLAs highly frequent in the Colombian population. Some of them with estimated coverage greater than 80%. These peptides were predicted to exhibit immunogenic response, non-allergenicity and non-toxicity. Therefore, these may be useful in the processes of epitope-based vaccine design and diagnostic test development, and are suggested as molecules to be prioritized for further in vitro and in vivo analysis.