Ensemble Machine Learning Model to Predict SARS-CoV-2 T-Cell Epitopes as Potential Vaccine Targets
Abstract
:1. Introduction
1.1. Related Work
1.2. Motivations
- There are numerous drawbacks to using whole-organism vaccines, particularly in immunocompromised patients [45,46]. Epitope-based peptide vaccines can be utilized to overcome the issues associated with multicomponent and heterogeneous vaccines. They can act as powerful alternatives to conventional vaccines due to their low production cost, and less reactogenic and allergenic responses.
- The majority of the existing methods, as mentioned in Section 2, utilize ANNs [21] and few others utilize only SVM. However, ANNs are hardware dependent since they demand parallel processing power, depending on their structure [47]. Moreover, instead of relying on predictions by single classifiers, we can combine predictions from more powerful classifiers and combine them using an ensembling approach. Performance of the ensemble model, in terms of accuracy, is high and it is also considered a robust model [48,49].
- Furthermore, the majority of the methods described in Section 1.1 estimate peptide binding capacity. For these methods, it remains a problem to predict directly whether a particular peptide is a SARS-CoV-2 epitope or not. One method, namely, CTLpred [42], predicts directly, but the length of the peptide sequence is limited to 9-mers only. Therefore, a direct approach to T-cell epitope prediction has been proposed here, which resolves the first problem. The proposed ensemble model can predict epitopes having variable length (length > 9-mers), fixing the second problem associated with the existing methods.
- Because the SARS-CoV-2 virus is widely circulating in the community, the virus’s ability to mutate further is increasing. The recently discovered delta variant (B.1.617.2) is causing widespread problems [50]. Delta appears to be approximately 60% more transmissible than alpha (B.1.1.7) [6,7,9]. Existing vaccines may prove to be somewhat less effective against new variants. To protect against these variants, either the composition of existing vaccines has to be modified or a new vaccine is to be developed [10]. Time being the critical factor, an epitope-based peptide vaccine can be a great alternative, relying on their low costs, reduced time to production, being safe, and having potential for increasing immunogenicity and cross reactivity.
1.3. Contributions
- To develop an ensemble machine learning (ML) model for SARS-CoV-2 T-cell epitope prediction. The predicted epitopes of SARS-CoV-2 would act as potential vaccine candidates against this pathogen.
- The main focus is on accuracy, which is considered an essential criterion for epitope prediction. Moreover, other metrics such as AUC, sensitivity, precision, Gini, specificity, and F-score have been used for model evaluation.
- To carry out the comparative analysis of the proposed ensemble model with various existing prediction models, namely, support vector machine, random forest, neural network, decision tree, and adaBoost.
- To compare the proposed ensemble model with existing benchmark techniques using blind dataset.
- To assess the effectiveness of the proposed ensemble classification model using a technique called repeated 5-fold cross validation.
- To our knowledge, this is the first study to propose an ensemble ML model to predict T-cell epitopes of SARS-CoV-2 virus as potential vaccine targets for designing an epitope-based peptide vaccine.
2. Materials and Methods
2.1. Retreival of SARS-CoV-2 Peptide Sequences
2.2. Proposed Methodology
2.2.1. Data Cleansing and Feature Extraction
2.2.2. Feature Selection
2.2.3. Class Imbalance Handling
2.2.4. Preparing Blind Dataset for Comparative Analysis
2.2.5. Model Building Using Voting Ensemble
2.2.6. Random Forest as Base Classifier
2.2.7. Predictions by the Proposed Ensemble Model
3. Model Evaluation
- True positive (TP): This means that the actual class is positive and accurately classed as such.
- True negative (TN): This means that the actual class is negative and accurately classed as such.
- False positive (FP): This means that the actual class is negative and inaccurately classified as positive.
- False negative (FN): This means that the actual class is positive and inaccurately classified as negative.
4. Results
4.1. Result Analysis of Comparing The Proposed Model with Existing Prediction Models
4.2. Result Analysis of Repeated K-Fold Cross Validation
4.3. Result Ananlysis of Comparing the Proposed Model with Two Benchmark Techniques Using Blind Dataset
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Huang, C.; Wang, Y.; Li, X.; Ren, L.; Zhao, J.; Hu, Y.; Zhang, L.; Fan, G.; Xu, J.; Gu, X.; et al. Clinical features of patients infected with 2019 novel coronavirus in Wuhan, China. Lancet 2020, 395, 497–506. [Google Scholar] [CrossRef] [Green Version]
- Chakraborty, C.; Lee, S.-S.; Sharma, A.R.; Bhattacharya, M.; Sharma, G. The 2019 novel coronavirus disease (COVID-19) pandemic: A zoonotic prospective. Asian Pac. J. Trop. Med. 2020, 13, 242. [Google Scholar] [CrossRef]
- Coronaviridae Study Group of the International Committee on Taxonomy of Viruses. The species Severe acute respiratory syndrome-related coronavirus: Classifying 2019-nCoV and naming it SARS-CoV-2. Nat. Microbiol. 2020, 5, 536–544. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- COVID Live Update: 225,488,491 Cases and 4,644,376 Deaths from the Coronavirus—Worldometer. Available online: https://www.worldometers.info/coronavirus/ (accessed on 13 September 2021).
- Cov-Lineages. Available online: https://cov-lineages.org/global_report_B.1.617.2.html (accessed on 7 August 2021).
- Callaway, E. Delta coronavirus variant: Scientists brace for impact. Nature 2021, 595, 17–18. [Google Scholar] [CrossRef] [PubMed]
- CDC. Coronavirus Disease 2019 (COVID-19); Department of Health and Human Services, CDC: Atlanta, GA, USA, 2020. Available online: https://www.cdc.gov/coronavirus/2019-ncov/index.html (accessed on 9 September 2021).
- CDC. SARS-CoV-2 Variant Classifications and Definitions. 2021. Available online: https://www.cdc.gov/coronavirus/2019-ncov/variants/variant-info.html (accessed on 26 June 2021).
- Li, B.; Deng, A.; Li, K.; Hu, Y.; Li, Z.; Xiong, Q.; Liu, Z.; Guo, Q.; Zou, L.; Zhang, H.; et al. Viral infection and transmission in a large, well-traced outbreak caused by the SARS-CoV-2 Delta variant. MedRxiv 2021. [Google Scholar] [CrossRef]
- The Effects of Virus Variants on COVID-19 Vaccines. Available online: https://www.who.int/news-room/feature-stories/detail/the-effects-of-virus-variants-on-covid-19-vaccines (accessed on 7 August 2021).
- Su, S.; Wong, G.; Shi, W.; Liu, J.; Lai, A.C.; Zhou, J.; Liu, W.; Bi, Y.; Gao, G.F. Epidemiology, Genetic Recombination, and Pathogenesis of Coronaviruses. Trends Microbiol. 2016, 24, 490–502. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Khailany, R.A.; Safdar, M.; Ozaslan, M. Genomic characterization of a novel SARS-CoV-2. Gene Rep. 2020, 19, 100682. [Google Scholar] [CrossRef] [PubMed]
- Lineburg, K.E.; Grant, E.J.; Swaminathan, S.; Chatzileontiadou, D.S.; Szeto, C.; Sloane, H.; Panikkar, A.; Raju, J.; Crooks, P.; Rehan, S.; et al. CD8+ T cells specific for an immunodominant SARS-CoV-2 nucleocapsid epitope cross-react with selective seasonal coronaviruses. Immunity 2021, 54, 1055–1065.e5. [Google Scholar] [CrossRef] [PubMed]
- Zhang, X.; Tan, Y.; Ling, Y.; Lu, G.; Liu, F.; Yi, Z.; Jia, X.; Wu, M.; Shi, B.; Xu, S.; et al. Viral and host factors related to the clinical outcome of COVID-19. Nature 2020, 583, 437–440. [Google Scholar] [CrossRef]
- Schmidt, M.E.; Varga, S.M. The CD8 T Cell Response to Respiratory Virus Infections. Front. Immunol. 2018, 9, 678. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Ng, O.-W.; Chia, A.; Tan, A.T.; Jadi, R.S.; Leong, H.N.; Bertoletti, A.; Tan, Y.-J. Memory T cell responses targeting the SARS coronavirus persist up to 11 years post-infection. Vaccine 2016, 34, 2008–2014. [Google Scholar] [CrossRef] [PubMed]
- Channappanavar, R.; Perlman, S. Pathogenic human coronavirus infections: Causes and consequences of cytokine storm and immunopathology. Semin. Immunopathol. 2017, 39, 529–539. [Google Scholar] [CrossRef]
- Huber, S.E.; Beek, J.E.; de Jonge, J.; Eluytjes, W.; Baarle, D.E. T Cell Responses to Viral Infections “Opportunities for Peptide Vaccination. Front. Immunol. 2014, 5, 171. [Google Scholar] [CrossRef]
- Seder, R.A.; Darrah, P.A.; Roederer, M. T-cell quality in memory and protection: Implications for vaccine design. Nat. Rev. Immunol. 2008, 8, 247–258. [Google Scholar] [CrossRef]
- Le, T.T.; Cramer, J.P.; Chen, R.; Mayhew, S. Evolution of the COVID-19 vaccine development landscape. Nat. Rev. Drug Discov. 2020, 19, 667–668. [Google Scholar] [CrossRef] [PubMed]
- Sohail, M.S.; Ahmed, S.F.; Quadeer, A.A.; McKay, M.R. In silico T cell epitope identification for SARS-CoV-2: Progress and perspectives. Adv. Drug Deliv. Rev. 2021, 171, 29–47. [Google Scholar] [CrossRef] [PubMed]
- Naz, A.; Shahid, F.; Butt, T.T.; Awan, F.M.; Ali, A.; Malik, A. Designing Multi-Epitope Vaccines to Combat Emerging Coronavirus Disease 2019 (COVID-19) by Employing Immuno-Informatics Approach. Front. Immunol. 2020, 11, 1663. [Google Scholar] [CrossRef] [PubMed]
- Grifoni, A.; Sidney, J.; Zhang, Y.; Scheuermann, R.H.; Peters, B.; Sette, A. A Sequence Homology and Bioinformatic Approach Can Predict Candidate Targets for Immune Responses to SARS-CoV-2. Cell Host Microbe 2020, 27, 671–680.e2. [Google Scholar] [CrossRef]
- Vita, R.; Mahajan, S.; A Overton, J.; Dhanda, S.K.; Martini, S.; Cantrell, J.R.; Wheeler, D.K.; Sette, A.; Peters, B. The Immune Epitope Database (IEDB): 2018 update. Nucleic Acids Res. 2018, 47, D339–D343. [Google Scholar] [CrossRef] [Green Version]
- Baruah, V.; Bose, S. Immunoinformatics-aided identification of T cell and B cell epitopes in the surface glycoprotein of 2019-nCoV. J. Med. Virol. 2020, 92, 495–500. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Crooke, S.; Ovsyannikova, I.G.; Kennedy, R.B.; Poland, G.A. Immunoinformatic identification of B cell and T cell epitopes in the SARS-CoV-2 proteome. Sci. Rep. 2020, 10, 1–15. [Google Scholar] [CrossRef]
- Dong, R.; Chu, Z.; Yu, F.; Zha, Y. Contriving Multi-Epitope Subunit of Vaccine for COVID-19: Immunoinformatics Approaches. Front. Immunol. 2020, 11, 1784. [Google Scholar] [CrossRef] [PubMed]
- Nielsen, M.; Lundegaard, C.; Worning, P.; Lauemøller, S.L.; Lamberth, K.; Buus, S.; Brunak, S.; Lund, O. Reliable prediction of T-cell epitopes using neural networks with novel sequence representations. Protein Sci. 2003, 12, 1007–1017. [Google Scholar] [CrossRef] [PubMed]
- Hoof, I.; Peters, B.; Sidney, J.; Pedersen, L.E.; Sette, A.; Lund, O.; Buus, S.; Nielsen, M. NetMHCpan, a method for MHC class I binding prediction beyond humans. Immunogenetics 2008, 61, 1–13. [Google Scholar] [CrossRef] [Green Version]
- Nielsen, M.; Lundegaard, C.; Blicher, T.; Lamberth, K.; Harndahl, M.; Justesen, S.; Røder, G.; Peters, B.; Sette, A.; Lund, O.; et al. NetMHCpan, a Method for Quantitative Predictions of Peptide Binding to Any HLA-A and -B Locus Protein of Known Sequence. PLoS ONE 2007, 2, e796. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Stranzl, T.; Larsen, M.V.; Lundegaard, C.; Nielsen, M. NetCTLpan: Pan-specific MHC class I pathway epitope predictions. Immunogenetics 2010, 62, 357–368. [Google Scholar] [CrossRef] [Green Version]
- Abelin, J.; Keskin, D.B.; Sarkizova, S.; Hartigan, C.R.; Zhang, W.; Sidney, J.; Stevens, J.; Lane, W.; Zhang, G.L.; Eisenhaure, T.M.; et al. Mass Spectrometry Profiling of HLA-Associated Peptidomes in Mono-allelic Cells Enables More Accurate Epitope Prediction. Immunity 2017, 46, 315–326. [Google Scholar] [CrossRef] [Green Version]
- O’Donnell, T.J.; Rubinsteyn, A.; Bonsack, M.; Riemer, A.B.; Laserson, U.; Hammerbacher, J. MHCflurry: Open-Source Class I MHC Binding Affinity Prediction. Cell Syst. 2018, 7, 129–132.e4. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Jensen, K.K.; Andreatta, M.; Marcatili, P.; Buus, S.; Greenbaum, J.A.; Yan, Z.; Sette, A.; Peters, B.; Nielsen, M. Improved methods for predicting peptide binding affinity to MHC class II molecules. Immunology 2018, 154, 394–406. [Google Scholar] [CrossRef] [PubMed]
- Karosiene, E.; Rasmussen, M.; Blicher, T.; Lund, O.; Buus, S.; Nielsen, M. NetMHCIIpan-3.0, a common pan-specific MHC class II prediction method including all three human MHC class II isotypes, HLA-DR, HLA-DP and HLA-DQ. Immunogenetics 2013, 65, 711–724. [Google Scholar] [CrossRef] [PubMed]
- Reynisson, B.; Alvarez, B.; Paul, S.; Peters, B.; Nielsen, M. NetMHCpan-4.1 and NetMHCIIpan-4.0: Improved predictions of MHC antigen presentation by concurrent motif deconvolution and integration of MS MHC eluted ligand data. Nucleic Acids Res. 2020, 48, W449–W454. [Google Scholar] [CrossRef]
- Abelin, J.; Harjanto, D.; Malloy, M.; Suri, P.; Colson, T.; Goulding, S.P.; Creech, A.L.; Serrano, L.R.; Nasir, G.; Nasrullah, Y.; et al. Defining HLA-II Ligand Processing and Binding Rules with Mass Spectrometry Enhances Cancer Epitope Prediction. Immunity 2019, 51, 766–779.e17. [Google Scholar] [CrossRef] [PubMed]
- Chen, B.; Khodadoust, M.S.; Olsson, N.; Wagar, L.; Fast, E.; Liu, C.L.; Muftuoglu, Y.; Sworder, B.; Diehn, M.; Levy, R.; et al. Predicting HLA class II antigen presentation through integrated deep learning. Nat. Biotechnol. 2019, 37, 1332–1343. [Google Scholar] [CrossRef]
- Larsen, M.V.; Lundegaard, C.; Lamberth, K.; Buus, S.; Lund, O.; Nielsen, M. Large-scale validation of methods for cytotoxic T-lymphocyte epitope prediction. BMC Bioinform. 2007, 8, 424. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Nielsen, M.; Lundegaard, C.; Lund, O.; Keşmir, C. The role of the proteasome in generating cytotoxic T-cell epitopes: Insights obtained from improved predictions of proteasomal cleavage. Immunogenetics 2005, 57, 33–41. [Google Scholar] [CrossRef] [PubMed]
- Dönnes, P.; Elofsson, A. Prediction of MHC class I binding peptides, using SVMHC. BMC Bioinform. 2002, 3, 25. [Google Scholar] [CrossRef]
- Bhasin, M.; Raghava, G.P.S. Prediction of CTL epitopes using QM, SVM and ANN techniques. Vaccine 2004, 22, 3195–3204. [Google Scholar] [CrossRef]
- Meyers, L.M.; Gutiérrez, A.H.; Boyle, C.M.; Terry, F.; McGonnigal, B.G.; Salazar, A.; Princiotta, M.F.; Martin, W.D.; De Groot, A.S.; Moise, L. Highly conserved, non-human-like, and cross-reactive SARS-CoV-2 T cell epitopes for COVID-19 vaccine design and validation. NPJ Vaccines 2021, 6, 71. [Google Scholar] [CrossRef] [PubMed]
- Nathan, A.; Rossin, E.J.; Kaseke, C.; Park, R.J.; Khatri, A.; Koundakjian, D.; Urbach, J.M.; Singh, N.K.; Bashirova, A.; Tano-Menka, R.; et al. Structure-guided T cell vaccine design for SARS-CoV-2 variants and sarbecoviruses. Cell 2021, 184, 4401–4413.e10. [Google Scholar] [CrossRef]
- Roper, R.L.; E Rehm, K. SARS vaccines: Where are we? Expert Rev. Vaccines 2009, 8, 887–898. [Google Scholar] [CrossRef]
- Shang, W.; Yang, Y.; Rao, Y.; Rao, X. The outbreak of SARS-CoV-2 pneumonia calls for viral vaccines. NPJ Vaccines 2020, 5, 1–3. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Artificial Neural Networks Advantages and Disadvantages. Available online: https://www.linkedin.com/pulse/artificial-neural-networks-advantages-disadvantages-maad-m-mijwel (accessed on 10 July 2021).
- Ensemble Learning to Improve Machine Learning Results|by Vadim Smolyakov|Cube Dev. Available online: https://blog.statsbot.co/ensemble-learning-d1dcd548e936 (accessed on 10 July 2021).
- Why Use Ensemble Learning? Available online: https://machinelearningmastery.com/why-use-ensemble-learning/ (accessed on 10 July 2021).
- Mahase, E. Delta variant: What is happening with transmission, hospital admissions, and restrictions? BMJ 2021, 373, n1513. [Google Scholar] [CrossRef] [PubMed]
- Osorio, D.; Rondón-Villarreal, P.; Torres, R.T.R. Peptides: A Package for Data Mining of Antimicrobial Peptides. R J. 2015, 7, 4–14. [Google Scholar] [CrossRef]
- Hofmann, H.; Hare, E.; GGobi Foundation. Peptider: Evaluation of Diversity in Nucleotide Libraries; R Package Version 0.2.2. 2015. Available online: https://CRAN.R-project.org/package=peptider (accessed on 27 August 2021).
- R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2020; Available online: https://www.R-project.org/ (accessed on 27 August 2021).
- Kursa, M.; Rudnicki, W. Feature Selection with the Boruta Package. J. Stat. Softw. 2010, 36, 1–13. [Google Scholar] [CrossRef] [Green Version]
- Chawla, N.V.; Bowyer, K.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic Minority Over-sampling Technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
- Jiang, J.; Wang, N.; Chen, P.; Zhang, J.; Wang, B. DrugECs: An Ensemble System with Feature Subspaces for Accurate Drug-Target Interaction Prediction. BioMed. Res. Int. 2017, 2017, 6340316. [Google Scholar] [CrossRef] [Green Version]
- Bukhari, S.N.H.; Jain, A.; Haq, E.; Khder, M.A.; Neware, R.; Bhola, J.; Najafi, M.L. Machine Learning-Based Ensemble Model for Zika Virus T-Cell Epitope Prediction. J. Health Eng. 2021, 2021, 9591670. [Google Scholar] [CrossRef] [PubMed]
- Han, J.; Kamber, M.; Pei, J. Data Mining Concepts and Techniques; Morgan Kaufmann Elsevier: Waltham, MA, USA, 2012. [Google Scholar]
- Tan, P.N.; Kumar, V.; Steinbach, M. Introduction to Data Mining; Pearson Education: Manesar, India, 2016. [Google Scholar]
- Hooda, N.; Bawa, S.; Rana, P.S. B 2 FSE framework for high dimensional imbalanced data: A case study for drug toxicity prediction. Neurocomputing 2018, 276, 31–41. [Google Scholar] [CrossRef]
- Kohavi, R. A study of cross-validation and bootstrap for accuracy estimation and model selection. In Proceedings of the 14th International Joint Conference on Artificial Intelligence, Montreal, QC, Canada, 20–25 August 1995; Volume 14, pp. 1137–1145. [Google Scholar]
Sr. No | Method Name | Usage |
---|---|---|
01 02 03 04 05 06 | NetMHC [28] NetMHCpan [29,30] NetCTLpan_1.1 [31] NetCTLpan_4.0 [28] HLAthena [32] MHCflurry [33] | To predict HLA I class or CD8+ T-cell epitopes |
07 08 09 10 11 | NetHMCII_2.3 [34] NetMHCIIpan_3.0 [35] NetMHCIIpan_4.0 [36] NeonMHC2 [37] MARIA [38] | To predict HLA II class or CD4+ T-cell epitopes |
Feature Category | Physicochemical Property | Category Count | Notations Used |
---|---|---|---|
F1 | Aliphatic Index | 1 | F1 |
F2 | Boman Index | 1 | F2 |
F3 | Insta Index | 1 | F3 |
F4 | Probability of detection | 1 | F4 |
F5 | Cross-covariance index | 1 | F5 |
F6 | Hmoment Index | 2 | F6_1, F6_2 |
F7 | Molecular Weight | 2 | F7_1, F7_2 |
F8 | Peptide Charge for 45 scales | 45 | F8_1 to F8_45 |
F9 | Hydrophobicity at 44 scales | 44 | F9_1 to F9_44 |
F10 | Isoelectric Point for 9 pK scale | 9 | F10_1 to F10_9 |
F11 | Kidera Factors | 10 | F11_1 to F11_10 |
F12 | aaComp | 18 | F12_1 to F12_18 |
F13 | FASGAI vectors | 6 | F13_1 to F13_6 |
F14 | blosumIndices | 10 | F14_1 to F14_10 |
F15 | protFP descriptors | 8 | F15_1 to F15_8 |
F16 | Cruciani properties | 3 | F16_1 to F16_3 |
Peptide Sequence | F1 | F2 | ----- | F16_2 | F16_3 | Class |
---|---|---|---|---|---|---|
AFFGMSRIGMEVTPSGTW | 43.33 | 0.3938 | ----- | −0.302 | 0.082 | 1 |
HLMGWDYPK | 43.33 | 0.9477 | ----- | −0.091 | −0.022 | 1 |
TGTLIVNSVLLFLAF | 175.33 | 1.7473 | ----- | −0.284 | −0.03 | 0 |
SVLLFLAFVVFLLVT | 214 | −3.036 | ----- | −0.156 | −0.15 | 0 |
Rank | Feature | Rank | Feature |
---|---|---|---|
1 | F1 | 11 | F9_38 |
2 | F2 | 12 | F10_2 |
3 | F4 | 13 | F10_7 |
4 | F6_2 | 14 | F11_5 |
5 | F8_5 | 15 | F12_5 |
6 | F8_19 | 16 | F12_7 |
7 | F8_34 | 17 | F13_4 |
8 | F9_4 | 18 | F14_9 |
9 | F9_6 | 19 | F15_3 |
10 | F9_29 | 20 | F15_4 |
Model Name | Tuned Parameters | Method Name | Package Name |
---|---|---|---|
Neural network (NN) | size:10 | nnet | nnet |
Decision tree (DT) | maxsurrogate:0 and usesurrogate:0 | rpart | rpart |
Support vector machine (SVM) | type:svc and kernel: rbfdot | ksvm | kernlab |
Random forest (RF) | ntree:500 and mtry:2 | randomForest | randomForest |
adaBoost (ada) | type: “discrete”, iter:50 and nu:0.5 | ada | ada |
Model | Accuracy (%) | AUC | Gini | Sensitivity | Specificity | F-Score | Precision |
---|---|---|---|---|---|---|---|
Neural Network | 95.66 | 0.981 | 0.980 | 0.959 | 0.971 | 0.910 | 0.929 |
Decision Tree | 94.81 | 0.978 | 0.929 | 0.979 | 0.959 | 0.979 | 0.939 |
SVM | 96.32 | 0.982 | 0.932 | 0.981 | 0.946 | 0.957 | 0.948 |
RandomForest | 97.11 | 0.963 | 0.910 | 0.961 | 0.941 | 0.964 | 0.971 |
adaBoost | 95.87 | 0.989 | 0.978 | 0.961 | 0.957 | 0.959 | 0.976 |
Proposed Model | 98.20 | 0.991 | 0.994 | 0.982 | 0.971 | 0.990 | 0.981 |
Fold | Iteration 1 | Iteration 2 | Iteration 3 | Iteration 4 | Iteration 5 | |
---|---|---|---|---|---|---|
1 | 98.24 | 98.21 | 97.88 | 98.19 | 97.89 | Mean of (A) (overall accuracy) |
2 | 97.91 | 97.87 | 97.65 | 98.01 | 97.89 | |
3 | 98.11 | 98.15 | 98.76 | 97.34 | 97.90 | |
4 | 98.03 | 98.32 | 97.65 | 98.21 | 98.02 | |
5 | 97.71 | 98.30 | 97.96 | 97.89 | 97.62 | |
Mean Acc./iteration (A) | 98.00 | 98.17 | 97.98 | 97.93 | 97.86 | 97.99 |
SARS-CoV-2 Peptide Sequences | Actual Class | Binding Capacity by NetMHC | Predictions by CTLpred | Predictions by the Proposed Model |
---|---|---|---|---|
APAICHD | 1 | 37 | 1 | 1 |
TAPAICHD | 1 | 58 | 1 | 1 |
QLNRALTGIAVEQDK | 1 | 6.2 | - | 1 |
NFSQILPDPSKPSKR | 1 | 3.1 | - | 1 |
DILSRLD | 1 | 65 | 1 | 1 |
TGSNVFQTR | 1 | 45 | 1 | 1 |
HSSGVTREL | 1 | 23 | 1 | 1 |
YICGFIQQK | 1 | 4.2 | 1 | 1 |
VVCTEIDPK | 1 | 8.2 | 1 | 1 |
TIWFLLLSV | 1 | 76 | 1 | 1 |
TIADYNYKL | 1 | 9.8 | 1 | 1 |
SYYSLLMPI | 1 | 65 | 1 | 1 |
SVKGLQPSV | 1 | 12 | 1 | 1 |
SQDLSVVSKT | 1 | 19 | - | 1 |
QLEMELTPV | 1 | 42 | 1 | 1 |
QLEMELTPV | 1 | 7.3 | 1 | 1 |
NYNYRYRLF | 1 | 1.9 | 1 | 1 |
NIADYNYKL | 1 | 44 | 1 | 1 |
LLIIMRTFK | 1 | 71 | 1 | 1 |
KLDGFMGRI | 1 | 6.0 | 1 | 1 |
HTITVEELK | 0 | 4.6 | 0 | 0 |
SVKHVYQL | 0 | 52 | 0 | 0 |
EYHLMSFPQSAPHGV | 0 | 79 | - | 0 |
DIKNLSKSL | 0 | 80 | 0 | 0 |
VWNLDY | 0 | 40 | 0 | 0 |
VTLAILTAL | 0 | 32 | 0 | 0 |
YLNTLTLAV | 0 | 41.2 | 0 | 0 |
EPVLKGVKL | 0 | 5.6 | 0 | 0 |
AAGLEAPFL | 0 | 9.3 | 0 | 0 |
WTAGAAAYY | 0 | 4.4 | 0 | 0 |
YLDGADVTK | 0 | 83 | 0 | 0 |
SQLGGLHLL | 0 | 65 | 0 | 0 |
LVKPSFYVY | 0 | 12 | 0 | 0 |
LPYPDPSRI | 0 | 15.7 | 0 | 0 |
AEWFLAYIL | 0 | 4.4 | 0 | 0 |
VLLSVLQQL | 0 | 11 | 0 | 0 |
SLPSYAAFATA | 0 | 89 | - | 0 |
TLMNVLTLV | 0 | 37 | 0 | 0 |
IPLTTAAKL | 0 | 61 | 0 | 0 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Bukhari, S.N.H.; Jain, A.; Haq, E.; Mehbodniya, A.; Webber, J. Ensemble Machine Learning Model to Predict SARS-CoV-2 T-Cell Epitopes as Potential Vaccine Targets. Diagnostics 2021, 11, 1990. https://doi.org/10.3390/diagnostics11111990
Bukhari SNH, Jain A, Haq E, Mehbodniya A, Webber J. Ensemble Machine Learning Model to Predict SARS-CoV-2 T-Cell Epitopes as Potential Vaccine Targets. Diagnostics. 2021; 11(11):1990. https://doi.org/10.3390/diagnostics11111990
Chicago/Turabian StyleBukhari, Syed Nisar Hussain, Amit Jain, Ehtishamul Haq, Abolfazl Mehbodniya, and Julian Webber. 2021. "Ensemble Machine Learning Model to Predict SARS-CoV-2 T-Cell Epitopes as Potential Vaccine Targets" Diagnostics 11, no. 11: 1990. https://doi.org/10.3390/diagnostics11111990