Accelerating the Screening of Small Peptide Ligands by Combining Peptide-Protein Docking and Machine Learning
Abstract
1. Introduction
2. Results and Discussion
2.1. Molecular Docking Results
2.2. Machine Learning Algorithm Selection
2.3. Evaluation of LightGBM Models
2.3.1. Group Selection and Hyperparameter Tuning
2.3.2. Performance of the Models
2.3.3. Importance of Choosing the Right Features
2.4. LightGBM Versus Molecular Docking Results
2.5. Case Study
3. Materials and Methods
3.1. Molecular Docking
3.2. Case Study
3.3. Datasets and Feature Extraction
- The datasets containing the tetrapeptide sequences and the molecular docking scores were combined with the peptide’s properties.
- A binary target variable (0 or 1) was added to distinguish between ‘better performers’ and ‘worse performers’ groups. The size of these groups varied depending on the stage of the process. A range between 1% to 40% for ‘better performers’ and 60% to 99% for ‘worse performers’ groups was evaluated.
- The datasets are divided into train and test sets. Train sets varying from 1% to 10% were evaluated.
3.4. Algorithm Selection
3.5. Light Gradient Boosting Machine
3.6. Hyperparameters Tuning
- num_leaves: integer values from 8 to 31
- max_depth: integer values from 1 to 10
- learning_rate: continuous values from 0.001 to 0.9
- scale_pos_weight: integer values from 1 to 50
- min_data_in_leaf: integer values from 5 to 90
- feature_fraction: continuous values from 0.1 to 1
- bagging_freq: continuous values from 0.1 to 1
- pos_bagging_fraction: continuous values from 0.1 to 0.9
- neg_bagging_fraciton: continuous values from 0.1 to 0.9
3.7. Metric Calculation
- Accuracy:
- Sensitivity (TPR):
- Specificity:
- Precision (PPV):
3.8. Data Analysis and Availability
4. Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Muttenthaler, M.; King, G.F.; Adams, D.J.; Alewood, P.F. Trends in Peptide Drug Discovery. Nat. Rev. Drug Discov. 2021, 20, 309–325. [Google Scholar] [CrossRef]
- Wang, L.; Wang, N.; Zhang, W.; Cheng, X.; Yan, Z.; Shao, G.; Wang, X.; Wang, R.; Fu, C. Therapeutic Peptides: Current Applications and Future Directions. Signal Transduct. Target. Ther. 2022, 7, 48. [Google Scholar] [CrossRef]
- Al Musaimi, O.; Al Shaer, D.; Albericio, F.; De la Torre, B.G. 2022 FDA TIDES (Peptides and Oligonucleotides) Harvest. Pharmaceuticals 2023, 16, 336. [Google Scholar] [CrossRef] [PubMed]
- Mahdavi, S.Z.B.; Oroojalian, F.; Eyvazi, S.; Hejazi, M.; Baradaran, B.; Pouladi, N.; Tohidkia, M.R.; Mokhtarzadeh, A.; Muyldermans, S. An Overview on Display Systems (Phage, Bacterial, and Yeast Display) for Production of Anticancer Antibodies; Advantages and Disadvantages. Int. J. Biol. Macromol. 2022, 208, 421–442. [Google Scholar] [CrossRef]
- Maurya, N.S.; Kushwaha, S.; Mani, A. Recent Advances and Computational Approaches in Peptide Drug Discovery. Curr. Pharm. Des. 2019, 25, 3358–3366. [Google Scholar] [CrossRef] [PubMed]
- Poustforoosh, A.; Faramarz, S.; Nematollahi, M.H.; Hashemipour, H.; Negahdaripour, M.; Pardakhty, A. In Silico SELEX Screening and Statistical Analysis of Newly Designed 5mer Peptide-Aptamers as Bcl-Xl Inhibitors Using the Taguchi Method. Comput. Biol. Med. 2022, 146, 105632. [Google Scholar] [CrossRef]
- Rabbani, G.; Baig, M.H.; Ahmad, K.; Choi, I. Protein-Protein Interactions and Their Role in Various Diseases and Their Prediction Techniques. Curr. Protein Pept. Sci. 2018, 19, 948–957. [Google Scholar] [CrossRef]
- Agrawal, P.; Singh, H.; Kumar Srivastava, H.; Singh, S.; Kishore, G.; Raghava, G.P.S. Benchmarking of Different Molecular Docking Methods for Protein-Peptide Docking. BMC Bioinform. 2019, 19 (Suppl. S13), 426. [Google Scholar] [CrossRef]
- Ferreira, L.G.; Dos Santos, R.N.; Oliva, G.; Andricopulo, A.D. Molecular Docking and Structure-Based Drug Design Strategies. Molecules 2015, 20, 13384–13421. [Google Scholar] [CrossRef]
- Saikia, S.; Bordoloi, M. Molecular Docking: Challenges, Advances and Its Use in Drug Discovery Perspective. Curr. Drug Target. 2018, 20, 501–521. [Google Scholar] [CrossRef]
- Mascini, M.; Dikici, E.; Mañueco, M.R.; Perez-Erviti, J.A.; Deo, S.K.; Compagnone, D.; Wang, J.; Pingarrón, J.M.; Daunert, S. Computationally Designed Peptides for Zika Virus Detection: An Incremental Construction Approach. Biomolecules 2019, 9, 498. [Google Scholar] [CrossRef] [PubMed]
- Gentile, F.; Yaacoub, J.C.; Gleave, J.; Fernandez, M.; Ton, A.T.; Ban, F.; Stern, A.; Cherkasov, A. Artificial Intelligence–Enabled Virtual Screening of Ultra-Large Chemical Libraries with Deep Docking. Nat. Protoc. 2022, 17, 672–697. [Google Scholar] [CrossRef] [PubMed]
- Cho, K.; Kim, D.; Lee, D. A Feature-Based Approach to Modeling Protein—Protein Interaction Hot Spots. Nucleic Acids Res. 2009, 37, 2672–2687. [Google Scholar] [CrossRef] [PubMed]
- Tang, T.; Zhang, X.; Liu, Y.; Peng, H.; Zheng, B.; Yin, Y.; Zeng, X. Machine Learning on Protein-Protein Interaction Prediction: Models, Challenges and Trends. Brief. Bioinform. 2023, 24, bbad076. [Google Scholar] [CrossRef]
- Zhan, X.; You, Z.; Li, L.; Li, Y.; Wang, Z. Using Random Forest Model Combined With Gabor Feature to Predict Protein-Protein Interaction From Protein Sequence. Evol. Bioinform. 2020, 16. [Google Scholar] [CrossRef]
- Zhang, J.; Lyu, Y.; Ma, Z. Prediction of Protein-Protein Interaction Sites by Multifeature Fusion and RF with mRMR and IFS. Dis. Markers 2022, 2022, 5892627. [Google Scholar] [CrossRef]
- Hou, Q.; De Geest, P.F.G.; Vranken, W.F.; Heringa, J.; Feenstra, K.A. Seeing the Trees through the Forest: Sequence- Based Homo- and Heteromeric Protein-Protein Interaction Sites Prediction Using Random Forest. Bioinformatics 2017, 33, 1479–1487. [Google Scholar] [CrossRef]
- Ye, J.; Li, A.; Zheng, H.; Yang, B.; Lu, Y. Machine Learning Advances in Predicting Peptide/Protein-Protein Interactions Based on Sequence Information for Lead Peptides Discovery. Adv. Biol 2023, 7, e2200232. [Google Scholar] [CrossRef]
- Friedman, J.H. Greedy Function Approximation: A Gradient Boosting Machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
- LightGBM’s Documentation—LightGBM 3.3.2 Documentation. Available online: https://lightgbm.readthedocs.io/en/v3.3.2/index.html (accessed on 1 May 2023).
- Plisson, F.; Ramírez-Sánchez, O.; Martínez-Hernández, C. Machine Learning-Guided Discovery and Design of Non-Hemolytic Peptides. Sci. Rep. 2020, 10, 16581. [Google Scholar] [CrossRef]
- Wang, X.; Yu, B.; Ma, A.; Chen, C.; Liu, B.; Ma, Q. Sequence Analysis Protein—Protein Interaction Sites Prediction by Ensemble Random Forests with Synthetic Minority Oversampling Technique. Bioinformatics 2019, 35, 2395–2402. [Google Scholar] [CrossRef] [PubMed]
- Sikandar, A.; Anwar, W.; Bajwa, U.I.; Wang, X.; Sikandar, M.; Yao, L.I.N.; Jiang, Z.O.E.L.; Chunkai, Z. Decision Tree Based Approaches for Detecting Protein Complex in Protein Protein Interaction Network (PPI) via Link and Sequence Analysis. IEEE Access 2018, 6, 22108–22120. [Google Scholar] [CrossRef]
- Molecular Modeling Software. OpenEye Scientific. Available online: https://www.eyesopen.com/ (accessed on 20 February 2023).
- Ravindranath, P.A.; Forli, S.; Goodsell, D.S.; Olson, A.J.; Sanner, M.F. AutoDockFR: Advances in Protein-Ligand Docking with Explicitly Specified Binding Site Flexibility. PLoS Comput. Biol. 2015, 11, e1004586. [Google Scholar] [CrossRef] [PubMed]
- Yan, C.; Xu, X.; Zou, X. Fully Blind Docking at the Atomic Level for Protein-Peptide Complex Structure Prediction. Structure 2016, 24, 1842–1853. [Google Scholar] [CrossRef]
- Schindler, C.E.M.; De Vries, S.J.; Zacharias, M. Fully Blind Peptide-Protein Docking with PepATTRACT. Structure 2015, 23, 1507–1515. [Google Scholar] [CrossRef]
- Zhou, P.; Li, B.; Yan, Y.; Jin, B.; Wang, L.; Huang, S.Y. Hierarchical Flexible Peptide Docking by Conformer Generation and Ensemble Docking of Peptides. J. Chem. Inf. Model. 2018, 58, 1292–1302. [Google Scholar] [CrossRef]
- Zhang, Y.; Sanner, M.F. AutoDock CrankPep: Combining Folding and Docking to Predict Protein-Peptide Complexes. Bioinformatics 2019, 35, 5121–5127. [Google Scholar] [CrossRef]
- RCSB PDB—3N40: Crystal Structure of the Immature Envelope Glycoprotein Complex of Chikungunya Virus. Available online: https://www.rcsb.org/structure/3N40 (accessed on 3 May 2023).
- RCSB PDB—3I50: Crystal Structure of the West Nile Virus Envelope Glycoprotein in Complex with the E53 Antibody Fab. Available online: https://www.rcsb.org/structure/3I50 (accessed on 3 May 2023).
- RCSB PDB—5IRE: The Cryo-EM Structure of Zika Virus. Available online: https://www.rcsb.org/structure/5IRE (accessed on 20 February 2023).
- RCSB PDB—4UTC: Crystal Structure of Dengue 2 Virus Envelope Glycoprotein. Available online: https://www.rcsb.org/structure/4UTC (accessed on 20 February 2023).
- Osorio, D.; Rondón-Villarreal, P.; Torres, R. Peptides: A Package for Data Mining of Antimicrobial Peptides. R J. 2015, 7, 4–14. [Google Scholar] [CrossRef]
- Mascini, M.; Dikici, E.; Perez-Erviti, J.A.; Deo, S.K.; Compagnone, D.; Daunert, S. A New Class of Sensing Elements for Sensors: Clamp Peptides for Zika Virus. Biosens. Bioelectron. 2021, 191, 113471. [Google Scholar] [CrossRef]
- HyperChem. Available online: http://hypercubeusa.com/ (accessed on 21 February 2023).
- SZYBKI 2.5.1.1—Applications. Available online: https://docs.eyesopen.com/applications/szybki/index.html (accessed on 18 December 2022).
- OMEGA 4.2.1.1—Applications. Available online: https://docs.eyesopen.com/applications/omega/index.html (accessed on 18 December 2022).
- Cruciani, G.; Baroni, M.; Carosati, E.; Clementi, M.; Valigi, R.; Clementi, S. Peptide Studies by Means of Principal Properties of Amino Acids Derived from MIF Descriptors. J. Chemom. 2004, 18, 146–155. [Google Scholar] [CrossRef]
- Liang, G.; Chen, G.; Niu, W.; Li, Z. Factor Analysis Scales of Generalized Amino Acid Information as Applied in Predicting Interactions between the Human Amphiphysin-1 SH3 Domains and Their Peptide Ligands. Chem. Biol. Drug Des 2008, 71, 345–351. [Google Scholar] [CrossRef]
- Kidera, A.; Konishi, Y.; Oka, M.; Ooi, T.; Scheraga, H.A. Statistical Analysis of the Physical Properties of the 20 Naturally Occurring Amino Acids. J. Protein. Chem. 1985, 4, 23–55. [Google Scholar] [CrossRef]
- Van Westen, G.J.P.; Swier, R.F.; Cortes-Ciriano, I.; Wegner, J.K.; Overington, J.P.; Jzerman, A.P.I.; Van Vlijmen, H.W.T.; Bender, A. Benchmarking of Protein Descriptor Sets in Proteochemometric Modeling (Part 2): Modeling Performance of 13 Amino Acid Descriptor Sets. J. Cheminform. 2013, 5, 42. [Google Scholar] [CrossRef] [PubMed]
- Tian, F.; Zhou, P.; Li, Z. T-Scale as a Novel Vector of Topological Descriptors for Amino Acids and Its Application in QSARs of Peptides. J. Mol. Struct. 2007, 830, 106–115. [Google Scholar] [CrossRef]
- Mei, H.; Liao, Z.H.; Zhou, Y.; Li, S.Z. A New Set of Amino Acid Descriptors and Its Application in Peptide QSARs. Biopolymers 2005, 80, 775–786. [Google Scholar] [CrossRef]
- Sandberg, M.; Eriksson, L.; Jonsson, J.; Sjöström, M.; Wold, S. New Chemical Descriptors Relevant for the Design of Biologically Active Peptides. A Multivariate Characterization of 87 Amino Acids. J. Med. Chem. 1998, 41, 2481–2491. [Google Scholar] [CrossRef]
- Torrent, M.; Andreu, D.; Nogués, V.M.; Boix, E. Connecting Peptide Physicochemical and Antimicrobial Properties by a Rational Prediction Model. PLoS ONE 2011, 6, e16968. [Google Scholar] [CrossRef]
- Moore, D.S. Amino Acid and Peptide Net Charges: A Simple Calculational Procedure. Biochem. Educ. 1985, 13, 10–11. [Google Scholar] [CrossRef]
- Kawashima, S.; Pokarowski, P.; Pokarowska, M.; Kolinski, A.; Katayama, T.; Kanehisa, M. AAindex: Amino Acid Index Database, Progress Report 2008. Nucleic Acids Res. 2008, 36, D202–D205. [Google Scholar] [CrossRef]
- Gasteiger, E.; Gattiker, A.; Hoogland, C.; Ivanyi, I.; Appel, R.D.; Bairoch, A. ExPASy: The Proteomics Server for in-Depth Protein Knowledge and Analysis. Nucleic Acids Res. 2003, 31, 3784–3788. [Google Scholar] [CrossRef]
- Ikai, A. Thermostability and Aliphatic Index of Globular Proteins. J. Biochem. 1980, 88, 1895–1898. [Google Scholar] [CrossRef] [PubMed]
- Boman, H.G. Antibacterial Peptides: Basic Facts and Emerging Concepts. J. Intern. Med. 2003, 254, 197–215. [Google Scholar] [CrossRef] [PubMed]
- Eisenberg, D.; Weiss, R.M.; Terwilliger, T.C. The Helical Hydrophobic Moment: A Measure of the Amphiphilicity of a Helix. Nature 1982, 299, 371–374. [Google Scholar] [CrossRef]
- Guruprasad, K.; Reddy, B.V.B.; Pandit, M.W. Correlation between Stability of a Protein and Its Dipeptide Composition: A Novel Approach for Predicting in Vivo Stability of a Protein from Its Primary Sequence. Protein Eng. 1990, 4, 155–161. [Google Scholar] [CrossRef] [PubMed]
- Yan, Y. CRAN—Package RBayesianOptimization. Available online: https://cran.microsoft.com/snapshot/2021-11-01/web/packages/rBayesianOptimization/index.html (accessed on 1 May 2023).
- Parameters—LightGBM 3.3.2 Documentation. Available online: https://lightgbm.readthedocs.io/en/v3.3.2/Parameters.html (accessed on 1 May 2023).
- Kuhn, M.; Wing, J.; Weston, S.; Williams, A.; Keefer, C.; Engelhardt, A.; Cooper, T.; Mayer, Z.; Kenkel, B.; Team, R.C.; et al. Caret: Classification and Regression Training. 2023. Available online: https://ui.adsabs.harvard.edu/abs/2015ascl.soft05003K/abstract (accessed on 12 June 2023).
- Robin, X.; Turck, N.; Hainard, A.; Tiberti, N.; Lisacek, F.; Sanchez, J.C.; Müller, M. PROC: An Open-Source Package for R and S+ to Analyze and Compare ROC Curves. BMC Bioinform. 2011, 12, 77. [Google Scholar] [CrossRef] [PubMed]
- Wickham, H.; François, R.; Henry, L.; Müller, K.; Vaughan, D. Dplyr: A Grammar of Data Manipulation. Available online: https://dplyr.tidyverse.org.https://github.com/tidyverse/dplyr (accessed on 22 May 2023).
- Dowle, M.; Srinivasan, A. Data.Table: Extension of ‘data.Frame’. 2023. Available online: https://r-datatable.com (accessed on 13 June 2023).
- Wickham, H. Ggplot2: Elegant Graphics for Data Analysis; Springer: New York, NY, USA, 2016; ISBN 978-3-319-24277-4. [Google Scholar]





| Method | Time (min) | F1-Score | Accuracy | 
|---|---|---|---|
| LightGBM | 0.057 | 0.52 | 0.85 | 
| Naive Bayes | 0.874 | 0.56 | 0.77 | 
| RPART | 2.31 | 0.54 | 0.84 | 
| GBM | 20.8 | 0.56 | 0.86 | 
| NNET | 27.7 | 0.52 | 0.85 | 
| KNN | 311 | 0.55 | 0.84 | 
| RF | 326 | 0.52 | 0.83 | 
| SVM | 1690 | 0.53 | 0.86 | 
| “Better Performers” Size (%) | Training Size (%) | F1-Score | 
|---|---|---|
| 1% | 1% | 0.10 | 
| 1% | 5% | 0.13 | 
| 1% | 10% | 0.16 | 
| 10% | 1% | 0.46 | 
| 10% | 5% | 0.43 | 
| 10% | 10% | 0.44 | 
| 20% | 1% | 0.58 | 
| 20% | 5% | 0.58 | 
| 20% | 10% | 0.61 | 
| 30% | 1% | 0.67 | 
| 30% | 5% | 0.68 | 
| 30% | 10% | 0.67 | 
| 40% | 1% | 0.74 | 
| 40% | 5% | 0.75 | 
| 40% | 10% | 0.74 | 
| Metric | CHIKV | DENV | WNV | ZIKV | ||||
| X | σ | X | σ | X | σ | X | σ | |
| Accuracy | 0.85 | 0.01 | 0.83 | 0.01 | 0.82 | 0.01 | 0.85 | 0.01 | 
| Sensitivity | 0.76 | 0.02 | 0.66 | 0.02 | 0.67 | 0.03 | 0.73 | 0.08 | 
| Specificity | 0.87 | 0.01 | 0.87 | 0.01 | 0.86 | 0.02 | 0.88 | 0.02 | 
| F1-score | 0.67 | 0.01 | 0.61 | 0.004 | 0.61 | 0.003 | 0.66 | 0.07 | 
| Metric | CHIKV (AD) | DENV (AD) | WNV (AD) | ZIKV (AD) | ||||
| X | σ | X | σ | X | σ | X | σ | |
| Accuracy | 0.81 | 0.01 | 0.82 | 0.01 | 0.83 | 0.01 | 0.84 | 0.01 | 
| Sensitivity | 0.64 | 0.03 | 0.64 | 0.03 | 0.66 | 0.03 | 0.72 | 0.03 | 
| Specificity | 0.86 | 0.02 | 0.86 | 0.02 | 0.87 | 0.01 | 0.87 | 0.01 | 
| F1-score | 0.58 | 0.004 | 0.58 | 0.004 | 0.60 | 0.004 | 0.65 | 0.004 | 
| OpenEye | AutoDockFR | |||
|---|---|---|---|---|
| Dataset | Feature | Gain | Feature | Gain | 
| CHIKV | Molecular Weight | 27% | ProtFP2 | 29% | 
| VHSE | 26% | T-scales | 28% | |
| ProtFP2 | 10% | Molecular Weight | 5% | |
| Kidera Factors | 10% | Z-scales | 5% | |
| T-scales | 2% | Hydrophobicity (Wolfenden) | 5% | |
| DENV | ProtFP2 | 28% | T-scales | 53% | 
| Cruciani (3) | 14% | Cruciani (1) | 8% | |
| Molecular Weight | 10% | VHSE | 6% | |
| ProtFP3 | 6% | Fasgai Vectors (6) | 4% | |
| Cruciani (1) | 4% | Kidera Factors | 3% | |
| WNV | Molecular Weight | 45% | T-scales | 36% | 
| PP3 | 10% | VHSE | 13% | |
| Z-scales | 8% | ProtFP2 | 10% | |
| Fasgai Vectors | 7% | Fasgai Vectors (5) | 9% | |
| Kidera Factors | 6% | Cruciani (1) | 5% | |
| ZIKV | Molecular Weight | 60% | T-scales | 28% | 
| Cruciani (3) | 7% | Fasgai Vectors (6) | 19% | |
| T-scales | 6% | ProtFP2 | 7% | |
| VHSE | 4% | Fasgai Vectors (5) | 6% | |
| Kidera Factors | 2% | Charge (EMBOSS) | 5% | |
| Peptides Selected by ML | Concurrence | Time Reduction Factor | |
|---|---|---|---|
| Openeye | AutoDockFR | ||
| 50,000 | 100% | 100% | ×3.2 | 
| 32,000 | 99% | 98% | ×5 | 
| 16,000 | 95% | 90% | ×10 | 
| 8000 | 85% | 81% | ×20 | 
| 4000 | 69% | 67% | ×40 | 
| 2000 | 50% | 51% | ×80 | 
| 1000 | 33% | 38% | ×160 | 
| 500 | 19% | 27% | ×320 | 
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. | 
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Codina, J.-R.; Mascini, M.; Dikici, E.; Deo, S.K.; Daunert, S. Accelerating the Screening of Small Peptide Ligands by Combining Peptide-Protein Docking and Machine Learning. Int. J. Mol. Sci. 2023, 24, 12144. https://doi.org/10.3390/ijms241512144
Codina J-R, Mascini M, Dikici E, Deo SK, Daunert S. Accelerating the Screening of Small Peptide Ligands by Combining Peptide-Protein Docking and Machine Learning. International Journal of Molecular Sciences. 2023; 24(15):12144. https://doi.org/10.3390/ijms241512144
Chicago/Turabian StyleCodina, Josep-Ramon, Marcello Mascini, Emre Dikici, Sapna K. Deo, and Sylvia Daunert. 2023. "Accelerating the Screening of Small Peptide Ligands by Combining Peptide-Protein Docking and Machine Learning" International Journal of Molecular Sciences 24, no. 15: 12144. https://doi.org/10.3390/ijms241512144
APA StyleCodina, J.-R., Mascini, M., Dikici, E., Deo, S. K., & Daunert, S. (2023). Accelerating the Screening of Small Peptide Ligands by Combining Peptide-Protein Docking and Machine Learning. International Journal of Molecular Sciences, 24(15), 12144. https://doi.org/10.3390/ijms241512144
 
        





 
       