Role of Machine and Deep Learning in Predicting Protein Modification Sites: Review and Future Directions
Abstract
1. Introduction
2. Datasets and Data Pre-Processing
2.1. Dataset
2.1.1. UniProt
2.1.2. dbPTM
2.1.3. CPLM 4.0
| Name | Website | PTM Type | Statistics |
|---|---|---|---|
| UniProt [12,13] | https://www.uniprot.org/ | Multiple | 570,420 reviewed proteins, 251,131,639 unreviewed proteins |
| dbPTM [14,15] | https://biomics.lab.nycu.edu.tw/dbPTM/ | Multiple | 2,235,664 sites, 70+ PTM types, 40+ integrated databases, 30+ benchmark datasets |
| PhosphoSitePlus [20] | https://www.phosphosite.org/homeAction | Multiple | 59,469 PTM sites, 13 PTM types |
| CPLM 4.0 [16] | http://cplm.biocuckoo.cn/ | Multiple | 463,156 unique sites of 105,673 proteins for up to 29 PLM types across 219 species |
| qPTM [21] | http://qptm.omicsbio.info/ | Multiple | 11,482,553 quantification events for 660,030 sites on 40,728 proteins under 2596 conditions |
| PupDB [22] | https://cwtung.kmu.edu.tw/pupdb/ | Pupylation | 268 pupylation proteins with 311 known pupylation sites and 1123 candidate pupylation proteins |
| DEPOD [23] | https://depod.bioss.uni-freiburg.de/ | Phosphorylation | 194 phosphatases have substrate data |
| O-GlcNAcAtlas [24] | https://oglcnac.org/atlas/ | O-GlcNAcylation | 16,877 Unambiguous sites, 10,058 ambiguous sites |
| Phospho.elm [25] | http://phospho.elm.eu.org/ | Phosphorylation | 42,914 instances, 11,224 sequences |
| CarbonylDB [26] | https://carbonyldb.missouri.edu/CarbonylDB/index.php/ | Carbonylation | 1495 proteins, 3781 PTM sites, 21 species |
| Scop3P [27] | https://iomics.ugent.be/scop3p/index | Phosphorylation | 108,130 modifications, 20,394 proteins |
| O-GlycBase [28] | https://services.healthtech.dtu.dk/datasets/OglycBase/ | O-Glycosylation | 242 proteins |
| dbSNO [29] | http://140.138.144.145/~dbSNO/index.php | S-nitrosylation | 174 experimentally verified S-nitrosylation sites on 94 S-nitrosylated proteins |
| UbiNet 2.0 [30] | https://awi.cuhk.edu.cn/~ubinet/index.php | Ubiquitination | 3332 experimentally verified ESIs |
| UbiBrowser 2.0 [31] | http://ubibrowser.bio-it.cn/ubibrowser_v3/ | Ubiquitination | 1,884,676 predicted high-confidence ESIs, 8,341,262 potential E3 recognizing motifs, 4068 known ESIs from literature |
| PhosPhAt [32] | https://phosphat.uni-hohenheim.de/ | Phosphorylation | 10,898 phosphoproteins, 64,128 serine sites, 13,102 threonine sites, 2672 tyrosine sites |
2.2. Data Pre-Processing
2.2.1. Sequence Slice
2.2.2. Sequence Redundancy
2.2.3. Selected Reliable Negative Sequences
2.2.4. Balanced Dataset
- (1)
- Data based methods
- (2)
- Algorithm-based method
- (3)
- Hybrid-based methods
2.2.5. Data Splitting
3. Feature Engineering
3.1. Feature Extraction
3.1.1. Sequence-Based Feature
3.1.2. Physicochemical-Based Feature
3.1.3. Annotation-Based Feature
3.1.4. Deep Learning-Based Feature
3.2. Feature Reduction
4. Classifiers
4.1. Machine Learning
4.2. Deep Learning
5. Measurement
6. Summary of Predictors
7. Challenges and Future Directions
7.1. Data Limitations
7.2. Interpretability
7.3. Multi-PTM and PTM Crosstalk Prediction
8. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
| PTM | Post-translational modifications |
| DL | Deep Learning |
| ML | Machine learning |
References
- Shrestha, P.; Kandel, J.; Tayara, H.; Chong, K.T. DL-SPhos: Prediction of serine phosphorylation sites using transformer language model. Comput. Biol. Med. 2024, 169, 107925. [Google Scholar] [CrossRef] [PubMed]
- Liang, J.-Z.; Li, D.-H.; Xiao, Y.-C.; Shi, F.-J.; Zhong, T.; Liao, Q.-Y.; Wang, Y.; He, Q.-Y. LAFEM: A Scoring Model to Evaluate Functional Landscape of Lysine Acetylome. Mol. Cell. Proteom. MCP 2024, 23, 100700. [Google Scholar] [CrossRef] [PubMed]
- Chang, K.W.; Gao, D.; Yan, J.D.; Lin, L.Y.; Cui, T.T.; Lu, S.M. Critical Roles of Protein Arginine Methylation in the Central Nervous System. Mol. Neurobiol. 2023, 60, 6060–6091. [Google Scholar] [CrossRef] [PubMed]
- Dai, X.F.; Zhang, T.X.; Hua, D. Ubiquitination and SUMOylation: Protein homeostasis control over cancer. Epigenomics 2022, 14, 43–58. [Google Scholar] [CrossRef]
- Masbuchin, A.N.; Rohman, M.S.; Liu, P.Y. Role of Glycosylation in Vascular Calcification. Int. J. Mol. Sci. 2021, 22, 9829. [Google Scholar] [CrossRef]
- Wohlschlager, T.; Scheffler, K.; Forstenlehner, I.C.; Skala, W.; Senn, S.; Damoc, E.; Holzmann, J.; Huber, C.G. Native mass spectrometry combined with enzymatic dissection unravels glycoform heterogeneity of biopharmaceuticals. Nat. Commun. 2018, 9, 1713. [Google Scholar] [CrossRef]
- Park, H.; Song, W.Y.; Cha, H.; Kim, T.Y. Development of an optimized sample preparation method for quantification of free fatty acids in food using liquid chromatography-mass spectrometry. Sci. Rep. 2021, 11, 5947. [Google Scholar] [CrossRef]
- Slade, D.J.; Subramanian, V.; Fuhrmann, J.; Thompson, P.R. Chemical and Biological Methods to Detect Post-Translational Modifications of Arginine. Biopolymers 2014, 101, 133–143. [Google Scholar] [CrossRef]
- Li, F.Y.; Dong, S.Y.; Leier, A.; Han, M.; Guo, X.D.; Xu, J.; Wang, X.Y.; Pan, S.R.; Jia, C.Z.; Zhang, Y.; et al. Positive-unlabeled learning in bioinformatics and computational biology: A brief review. Brief. Bioinform. 2022, 23, bbab461. [Google Scholar] [CrossRef]
- Qiao, Y.H.; Zhu, X.L.; Gong, H.P. BERT-Kcr: Prediction of lysine crotonylation sites by a transfer learning method with pre-trained BERT models. Bioinformatics 2022, 38, 648–654. [Google Scholar] [CrossRef]
- Li, Y.Y.; Liu, Z.; Liu, X.; Zhu, Y.H.; Fang, C.H.; Arif, M.; Qiu, W.R. A Systematic Review of Computational Methods for Protein Post-Translational Modification Site Prediction. Arch. Comput. Methods Eng. 2025, 1–21. [Google Scholar] [CrossRef]
- Lussi, Y.C.; Magrane, M.; Martin, M.J.; Orchard, S. Searching and Navigating UniProt Databases. Curr. Protoc. 2023, 3, e700. [Google Scholar] [CrossRef] [PubMed]
- Bairoch, A.; Bougueleret, L.; Altairac, S. The Universal Protein Resource (UniProt). Nucleic Acids Res. 2008, 36, D190–D195. [Google Scholar] [CrossRef]
- Li, Z.Y.; Li, S.F.; Luo, M.Q.; Jhong, J.-H.; Li, W.S.; Yao, L.T.; Pang, Y.X.; Wang, Z.; Wang, R.L.; Ma, R.F.; et al. dbPTM in 2022: An updated database for exploring regulatory networks and functional associations of protein post-translational modifications. Nucleic Acids Res. 2022, 50, D471–D479. [Google Scholar] [CrossRef]
- Lee, T.Y.; Huang, H.D.; Hung, J.H.; Huang, H.Y.; Yang, Y.S.O.; Wang, T.H. dbPTM: An information repository of protein post-translational modification. Nucleic Acids Res. 2006, 34, D622–D627. [Google Scholar] [CrossRef]
- Zhang, W.Z.; Tan, X.D.; Lin, S.F.; Gou, Y.J.; Han, C.; Zhang, C.; Ning, W.S.; Wang, C.W.; Xue, Y. CPLM 4.0: An updated database with rich annotations for protein lysine modifications. Nucleic Acids Res. 2022, 50, D451–D459. [Google Scholar] [CrossRef]
- Liu, Z.X.; Cao, J.; Gao, X.J.; Zhou, Y.H.; Wen, L.P.; Yang, X.J.; Yao, X.B.; Ren, J.A.; Xue, Y. CPLA 1.0: An integrated database of protein lysine acetylation. Nucleic Acids Res. 2011, 39, D1029–D1034. [Google Scholar] [CrossRef]
- Liu, Z.X.; Wang, Y.B.; Gao, T.S.; Pan, Z.C.; Cheng, H.; Yang, Q.; Cheng, Z.Y.; Guo, A.Y.; Ren, J.; Xue, Y. CPLM: A database of protein lysine modifications. Nucleic Acids Res. 2014, 42, D531–D536. [Google Scholar] [CrossRef]
- Xu, H.D.; Zhou, J.Q.; Lin, S.F.; Deng, W.K.; Zhang, Y.; Xue, Y. PLMD: An updated data resource of protein lysine modifications. J. Genet. Genom. 2017, 44, 243–250. [Google Scholar] [CrossRef]
- Hornbeck, P.V.; Kornhauser, J.M.; Tkachev, S.; Zhang, B.; Skrzypek, E.; Murray, B.; Latham, V.; Sullivan, M. PhosphoSitePlus: A comprehensive resource for investigating the structure and function of experimentally determined post-translational modifications in man and mouse. Nucleic Acids Res. 2012, 40, D261–D270. [Google Scholar] [CrossRef]
- Yu, K.; Wang, Y.; Zheng, Y.Q.; Liu, Z.K.; Zhang, Q.F.; Wang, S.Y.; Zhao, Q.; Zhang, X.L.; Li, X.X.; Xu, R.H. qPTM: An updated database for PTM dynamics in human, mouse, rat and yeast. Nucleic Acids Res. 2023, 51, D479–D487. [Google Scholar] [CrossRef] [PubMed]
- Tung, C.W. PupDB: A database of pupylated proteins. BMC Bioinform. 2012, 13, 40. [Google Scholar] [CrossRef] [PubMed]
- Duan, G.Y.; Li, X.; Köhn, M. The human DEPhOsphorylation database DEPOD: A 2015 update. Nucleic Acids Res. 2015, 43, D531–D535. [Google Scholar] [CrossRef] [PubMed]
- Ma, J.F.; Li, Y.X.; Hou, C.Y.; Wu, C. O-GlcNAcAtlas: A database of experimentally identified O-GlcNAc sites and proteins. Glycobiology 2021, 31, 719–723. [Google Scholar] [CrossRef]
- Dinkel, H.; Chica, C.; Via, A.; Gould, C.M.; Jensen, L.J.; Gibson, T.J.; Diella, F. Phospho.ELM: A database of phosphorylation sites-update 2011. Nucleic Acids Res. 2011, 39, D261–D267. [Google Scholar] [CrossRef]
- Rao, R.S.P.; Zhang, N.; Xu, D.; Moller, I.M. CarbonylDB: A curated data-resource of protein carbonylation sites. Bioinformatics 2018, 34, 2518–2520. [Google Scholar] [CrossRef]
- Ramasamy, P.; Turan, D.; Tichshenko, N.; Hulstaert, N.; Vandermarliere, E.; Vranken, W.; Martens, L. Scop3P: A Comprehensive Resource of Human Phosphosites within Their Full Context. J. Proteome Res. 2020, 19, 3478–3486. [Google Scholar] [CrossRef]
- Hansen, J.E.; Lund, O.; Rapacki, K.; Brunak, S. O-GLYCBASE version 2.0: A revised database of O-glycosylated proteins. Nucleic Acids Res. 1997, 25, 278–282. [Google Scholar] [CrossRef]
- Lee, T.Y.; Chen, Y.J.; Lu, C.T.; Ching, W.C.; Teng, Y.C.; Huang, H.D.; Chen, Y.J. dbSNO: A database of cysteine S-nitrosylation. Bioinformatics 2012, 28, 2293–2295. [Google Scholar] [CrossRef]
- Li, Z.Y.; Chen, S.Y.; Jhong, J.H.; Pang, Y.X.; Huang, K.Y.; Li, S.F.; Lee, T.Y. UbiNet 2.0: A verified, classified, annotated and updated database of E3 ubiquitin ligase–substrate interactions. Database J. Biol. Databases Curation 2021, 2021, baab010. [Google Scholar] [CrossRef]
- Wang, X.; Li, Y.; He, M.Q.; Kong, X.R.; Jiang, P.; Liu, X.; Diao, L.H.; Zhang, X.L.; Li, H.L.; Ling, X.P.; et al. UbiBrowser 2.0: A comprehensive resource for proteome-wide known and predicted ubiquitin ligase/deubiquitinase-substrate interactions in eukaryotic species. Nucleic Acids Res. 2022, 50, D719–D728. [Google Scholar] [CrossRef] [PubMed]
- Durek, P.; Schmidt, R.; Heazlewood, J.L.; Jones, A.; MacLean, D.; Nagel, A.; Kersten, B.; Schulze, W.X. PhosPhAt: The Arabidopsis thaliana phosphorylation site database. An update. Nucleic Acids Res. 2010, 38, D828–D834. [Google Scholar] [CrossRef]
- Lai, F.L.; Gao, F. Auto-Kla: A novel web server to discriminate lysine lactylation sites using automated machine learning. Brief. Bioinform. 2023, 24, bbad070. [Google Scholar] [CrossRef]
- Wei, L.Y.; Xing, P.W.; Shi, G.T.; Ji, Z.L.; Zou, Q. Fast Prediction of Protein Methylation Sites Using a Sequence-Based Feature Selection Technique. IEEE-ACM Trans. Comput. Biol. Bioinform. 2019, 16, 1264–1273. [Google Scholar] [CrossRef]
- Li, Z.T.; Fang, J.Y.; Wang, S.N.; Zhang, L.Y.; Chen, Y.Y.; Pian, C. Adapt-Kcr: A novel deep learning framework for accurate prediction of lysine crotonylation sites based on learning embedding features and attention architecture. Brief. Bioinform. 2022, 23, bbac037. [Google Scholar] [CrossRef]
- Sua, J.N.; Lim, S.Y.; Yulius, M.H.; Su, X.T.; Yapp, E.K.Y.; Le, N.Q.K.; Yeh, H.Y.; Chua, M.C.H. Incorporating convolutional neural networks and sequence graph transform for identifying multilabel protein Lysine PTM sites. Chemom. Intell. Lab. Syst. 2020, 206, 104171. [Google Scholar] [CrossRef]
- Lyu, X.R.; Li, S.H.; Jiang, C.Y.; He, N.N.; Chen, Z.; Zou, Y.; Li, L. DeepCSO: A Deep-Learning Network Approach to Predicting Cysteine S-Sulphenylation Sites. Front. Cell Dev. Biol. 2020, 8, 594587. [Google Scholar] [CrossRef]
- Auliah, F.N.; Nilamyani, A.N.; Shoombuatong, W.; Alam, M.A.; Hasan, M.M.; Kurata, H. PUP-Fuse: Prediction of Protein Pupylation Sites by Integrating Multiple Sequence Representations. Int. J. Mol. Sci. 2021, 22, 2120. [Google Scholar] [CrossRef]
- Bao, W.Z.; Yuan, C.A.; Zhang, Y.H.; Han, K.; Nandi, A.K.; Honig, B.; Huang, D.S. Mutli-Features Prediction of Protein Translational Modification Sites. IEEE-ACM Trans. Comput. Biol. Bioinform. 2018, 15, 1453–1460. [Google Scholar] [CrossRef] [PubMed]
- Khalili, E.; Ramazi, S.; Ghanati, F.; Kouchaki, S. Predicting protein phosphorylation sites in soybean using interpretable deep tabular learning network. Brief. Bioinform. 2022, 23, bbac015. [Google Scholar] [CrossRef] [PubMed]
- Li, W.Z.; Godzik, A. Cd-hit: A fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 2006, 22, 1658–1659. [Google Scholar] [CrossRef]
- Yu, B.; Yu, Z.M.; Chen, C.; Ma, A.J.; Liu, B.Q.; Tian, B.G.; Ma, Q. DNNAce: Prediction of prokaryote lysine acetylation sites through deep neural networks with multi-information fusion. Chemom. Intell. Lab. Syst. 2020, 200, 103999. [Google Scholar] [CrossRef]
- Arafat, M.E.; Ahmad, M.W.; Shovan, S.M.; Dehzangi, A.; Dipta, S.R.; Hasan, M.A.; Taherzadeh, G.; Shatabda, S.; Sharma, A. Accurately Predicting Glutarylation Sites Using Sequential Bi-Peptide-Based Evolutionary Features. Genes 2020, 11, 1023. [Google Scholar] [CrossRef] [PubMed]
- Jamal, S.; Ali, W.; Nagpal, P.; Grover, A.; Grover, S. Predicting phosphorylation sites using machine learning by integrating the sequence, structure, and functional information of proteins. J. Transl. Med. 2021, 19, 218. [Google Scholar] [CrossRef] [PubMed]
- Gao, Y.; Hao, W.L.; Gu, J.; Liu, D.W.; Fan, C.; Chen, Z.G.; Deng, L. PredPhos: An ensemble framework for structure-based prediction of phosphorylation sites. J. Biol. Res.-Thessalon. 2016, 23, S12. [Google Scholar] [CrossRef]
- Chen, Z.; Pang, M.; Zhao, Z.X.; Li, S.N.; Miao, R.; Zhang, Y.F.; Feng, X.Y.; Feng, X.; Zhang, Y.X.; Duan, M.Y.; et al. Feature selection may improve deep neural networks for the bioinformatics problems. Bioinformatics 2020, 36, 1542–1552. [Google Scholar] [CrossRef]
- Ning, Q.; Ma, Z.Q.; Zhao, X.W.; Yin, M.H. SSKM_Succ: A Novel Succinylation Sites Prediction Method Incorporating K-Means Clustering With a New Semi-Supervised Learning Algorithm. IEEE-ACM Trans. Comput. Biol. Bioinform. 2022, 19, 643–652. [Google Scholar] [CrossRef]
- Jiang, M.; Cao, J.Z. Positive-Unlabeled Learning for Pupylation Sites Prediction. Biomed Res. Int. 2016, 2016, 4525786. [Google Scholar] [CrossRef]
- Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
- He, H.B.; Bai, Y.; Garcia, E.A.; Li, S.T. ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning. In Proceedings of the International Joint Conference on Neural Networks, Hong Kong, China, 1–8 June 2008; pp. 1322–1328. [Google Scholar]
- Lu, Y.; Cheung, Y.M.; Tang, Y.Y. Hybrid Sampling with Bagging for Class Imbalance Learning. In Proceedings of the 20th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Auckland, New Zealand, 19–22 April 2016; pp. 14–26. [Google Scholar]
- Seiffert, C.; Khoshgoftaar, T.M.; Van Hulse, J. Hybrid sampling for imbalanced data. Integr. Comput.-Aided Eng. 2009, 16, 193–210. [Google Scholar] [CrossRef]
- Dongdong, L.; Ziqiu, C.; Bolu, W.; Zhe, W.; Hai, Y.; Wenli, D. Entropy-based hybrid sampling ensemble learning for imbalanced data. Int. J. Intell. Syst. 2021, 36, 3039–3067. [Google Scholar] [CrossRef]
- Wang, M.H.; Cui, X.W.; Yu, B.; Chen, C.; Ma, Q.; Zhou, H.Y. SulSite-GTB: Identification of protein S-sulfenylation sites by fusing multiple feature information and gradient tree boosting. Neural Comput. Appl. 2020, 32, 13843–13862. [Google Scholar] [CrossRef]
- Wang, M.H.; Song, L.L.; Zhang, Y.Q.; Gao, H.L.; Yan, L.; Yu, B. Malsite-Deep: Prediction of protein malonylation sites through deep learning and multi-information fusion based on NearMiss-2 strategy. Knowl.-Based Syst. 2022, 240, 108191. [Google Scholar] [CrossRef]
- Wilson, D.L. Asymptotic Properties of Nearest Neighbor Rules Using Edited Data. IEEE Trans. Syst. Man Cybern. 1972, SMC-2, 408–421. [Google Scholar] [CrossRef]
- Ijaz, M.F.; Attique, M.; Son, Y. Data-Driven Cervical Cancer Prediction Model with Outlier Detection and Over-Sampling Methods. Sensors 2020, 20, 2809. [Google Scholar] [CrossRef]
- Mbunge, E.; Millham, R.C.; Sibiya, M.N.; Chemhaka, G.; Takavarasha, S.; Muchemwa, B.; Dzinamarira, T. Implementation of ensemble machine learning classifiers to predict diarrhoea with SMOTEENN, SMOTE, and SMOTETomek class imbalance approaches. In Proceedings of the Conference on Information-Communications-Technology-and-Society (ICTAS), Durban, South Africa, 8–9 March 2023; pp. 90–95. [Google Scholar]
- Khan, S.H.; Hayat, M.; Bennamoun, M.; Sohel, F.A.; Togneri, R. Cost-Sensitive Learning of Deep Feature Representations From Imbalanced Data. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 3573–3587. [Google Scholar] [CrossRef]
- Yuan, Z.W.; Zhao, P. An Improved Ensemble Learning for Imbalanced Data Classification. In Proceedings of the IEEE 8th Joint International Information Technology and Artificial Intelligence Conference (ITAIC), Chongqing, China, 24–26 May 2019; pp. 408–411. [Google Scholar]
- Hu, X.S.; Zhang, R.J. Clustering-based Subset Ensemble Learning Method for Imbalanced Data. In Proceedings of the International Conference on Machine Learning and Cybernetics (ICMLC), Tianjin, China, 14–17 July 2013; pp. 35–39. [Google Scholar]
- Hayashi, T.; Fujita, H. One-class ensemble classifier for data imbalance problems. Appl. Intell. 2022, 52, 17073–17089. [Google Scholar] [CrossRef]
- Dou, L.J.; Yang, F.L.; Xu, L.; Zou, Q. A comprehensive review of the imbalance classification of protein post-translational modifications. Brief. Bioinform. 2021, 22, bbab089. [Google Scholar] [CrossRef]
- Branco, P.; Torgo, L.; Ribeiro, R.P. A Survey of Predictive Modeling on Im balanced Domains. Acm Comput. Surv. 2016, 49, 31. [Google Scholar]
- Kaur, H.; Pannu, H.S.; Malhi, A.K. A Systematic Review on Imbalanced Data Challenges in Machine Learning: Applications and Solutions. Acm Comput. Surv. 2019, 52, 79. [Google Scholar] [CrossRef]
- Wang, M.; Yang, J.; Liu, G.P.; Xu, Z.J.; Chou, K.C. Weighted-support vector machines for predicting membrane protein types based on pseudo-amino acid composition. Protein Eng. Des. Sel. 2004, 17, 509–516. [Google Scholar] [CrossRef]
- Lin, C.-F.; Wang, S.-D. Fuzzy support vector machines. IEEE Trans. Neural Netw. 2002, 13, 464–471. [Google Scholar]
- Ju, Z.; Wang, S.Y. Prediction of lysine formylation sites using the composition of k-spaced amino acid pairs via Chou’s 5-steps rule and general pseudo components. Genomics 2020, 112, 859–866. [Google Scholar] [CrossRef] [PubMed]
- Zhou, Z.H.; Liu, X.Y. Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Trans. Knowl. Data Eng. 2006, 18, 63–77. [Google Scholar] [CrossRef]
- Seiffert, C.; Khoshgoftaar, T.M.; Van Hulse, J.; Napolitano, A. RUSBoost: A Hybrid Approach to Alleviating Class Imbalance. IEEE Trans. Syst. Man Cybern. Part A Syst. Hum. 2010, 40, 185–197. [Google Scholar] [CrossRef]
- Jia, C.Z.; Zuo, Y.; Zou, Q. O-GlcNAcPRED-II: An integrated classification algorithm for identifying O-GlcNAcylation sites based on fuzzy undersampling and a K-means PCA oversampling technique. Bioinformatics 2018, 34, 2029–2036. [Google Scholar] [CrossRef]
- Wu, X.Y.; Srihari, R.; Zheng, Z.H. Document representation for one-class SVM. In Machine Learning: ECML 2004; Boulicaut, J.F., Esposito, F., Giannoti, F., Pedreschi, D., Eds.; Springer: Berlin/Heidelberg, Germany, 2004; Volume 3201, pp. 489–500. [Google Scholar]
- Islam, S.; Mugdha, S.B.; Dipta, S.R.; Arafat, M.E.; Shatabda, S.; Alinejad-Rokny, H.; Dehzangi, I. MethEvo: An accurate evolutionary information-based methylation site predictor. Neural Comput. Appl. 2022, 36, 201–212. [Google Scholar] [CrossRef]
- Huang, K.Y.; Hung, F.Y.; Kao, H.J.; Lau, H.H.; Weng, S.L. iDPGK: Characterization and identification of lysine phosphoglycerylation sites based on sequence-based features. BMC Bioinform. 2020, 21, 568. [Google Scholar] [CrossRef]
- Sahu, S.S.; Panda, G. A novel feature representation method based on Chou’s pseudo amino acid composition for protein structural class prediction. Comput. Biol. Chem. 2010, 34, 320–327. [Google Scholar] [CrossRef]
- Huang, K.Y.; Hsu, J.B.K.; Lee, T.Y. Characterization and Identification of Lysine Succinylation Sites based on Deep Learning Method. Sci. Rep. 2019, 9, 16175. [Google Scholar] [CrossRef]
- Jiang, P.R.; Ning, W.S.; Shi, Y.S.; Liu, C.; Mo, S.J.; Zhou, H.R.; Liu, K.D.; Guo, Y.P. FSL-Kla: A few-shot learning-based multi-feature hybrid system for lactylation site prediction. Comput. Struct. Biotechnol. J. 2021, 19, 4497–4509. [Google Scholar] [CrossRef] [PubMed]
- Suo, S.B.; Qiu, J.D.; Shi, S.P.; Sun, X.Y.; Huang, S.Y.; Chen, X.; Liang, R.P. Position-Specific Analysis and Prediction for Protein Lysine Acetylation Based on Multiple Features. PLoS ONE 2012, 7, e49108. [Google Scholar] [CrossRef] [PubMed]
- Shen, H.B.; Yang, J.; Chou, K.C. Fuzzy KNN for predicting membrane protein types from pseudo-amino acid composition. J. Theor. Biol. 2006, 240, 9–13. [Google Scholar] [CrossRef]
- Gao, J.J.; Thelen, J.J.; Dunker, A.K.; Xu, D. Musite, a Tool for Global Prediction of General and Kinase-specific Phosphorylation Sites. Mol. Cell. Proteom. 2010, 9, 2586–2600. [Google Scholar] [CrossRef]
- Shen, J.W.; Zhang, J.; Luo, X.M.; Zhu, W.L.; Yu, K.Q.; Chen, K.X.; Li, Y.X.; Jiang, H.L. Predicting protein-protein interactions based only on sequences information. Proc. Natl. Acad. Sci. USA 2007, 104, 4337–4341. [Google Scholar] [CrossRef]
- Saravanan, V.; Gautham, N. Harnessing Computational Biology for Exact Linear B-Cell Epitope Prediction: A Novel Amino Acid Composition-Based Feature Descriptor. OMICS J. Integr. Biol. 2015, 19, 648–658. [Google Scholar] [CrossRef]
- Park, K.J.; Kanehisa, M. Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs. Bioinformatics 2003, 19, 1656–1663. [Google Scholar] [CrossRef]
- Keskin, O.; Bahar, I.; Badretdinov, A.Y.; Ptitsyn, O.B.; Jernigan, R.L. Empirical solvent-mediated potentials hold for both intra-molecular and inter-molecular inter-residue interactions. Protein Sci. 1998, 7, 2578–2586. [Google Scholar] [CrossRef]
- Liang, S.D.; Grishin, N.V. Effective scoring function for protein sequence design. Proteins Struct. Funct. Bioinform. 2004, 54, 271–281. [Google Scholar] [CrossRef]
- Chan, C.H.; Liang, H.K.; Hsiao, N.W.; Ko, M.T.; Lyu, P.C.; Hwang, J.K. Relationship between local structural entropy and protein thermostability. Proteins Struct. Funct. Bioinform. 2004, 57, 684–691. [Google Scholar] [CrossRef]
- Tang, Y.R.; Chen, Y.Z.; Canchaya, C.A.; Zhang, Z.D. GANNPhos: A new phosphorylation site predictor based on a genetic algorithm integrated neural network. Protein Eng. Des. Sel. 2007, 20, 405–412. [Google Scholar] [CrossRef] [PubMed]
- Xu, Y.; Wang, X.B.; Wang, Y.C.; Tian, Y.J.; Shao, X.J.; Wu, L.Y.; Deng, N.Y. Prediction of posttranslational modification sites from amino acid sequences with kernel methods. J. Theor. Biol. 2014, 344, 78–87. [Google Scholar] [CrossRef] [PubMed]
- Lee, T.Y.; Lin, Z.Q.; Hsieh, S.J.; Bretaña, N.A.; Lu, C.T. Exploiting maximal dependence decomposition to identify conserved motifs from a group of aligned signal sequences. Bioinformatics 2011, 27, 1780–1787. [Google Scholar] [CrossRef] [PubMed]
- Kawashima, S.; Kanehisa, M. AAindex: Amino acid index database. Nucleic Acids Res. 2000, 28, 374. [Google Scholar] [CrossRef]
- Li, F.Y.; Li, C.; Wang, M.J.; Webb, G.I.; Zhang, Y.; Whisstock, J.C.; Song, J.N. GlycoMine: A machine learning-based approach for predicting N-, C- and O-linked glycosylation in the human proteome. Bioinformatics 2015, 31, 1411–1419. [Google Scholar] [CrossRef]
- Gong, W.M.; Zhou, D.H.; Ren, Y.L.; Wang, Y.J.; Zuo, Z.X.; Shen, Y.P.; Xiao, F.F.; Zhu, Q.; Hong, A.L.; Zhou, X.; et al. PepCyber:PPEP:: A database of human protein-protein interactions mediated by phosphoprotein-binding domains. Nucleic Acids Res. 2008, 36, D679–D683. [Google Scholar] [CrossRef]
- Wagner, M.; Adamczak, R.; Porollo, A.; Meller, J. Linear regression models for solvent accessibility prediction in proteins. J. Comput. Biol. 2005, 12, 355–369. [Google Scholar] [CrossRef]
- Tomii, K.; Kanehisa, M. Analysis of amino acid indices and mutation matrices for sequence comparison and structure prediction of proteins. Protein Eng. 1996, 9, 27–36. [Google Scholar] [CrossRef]
- Dubchak, I.; Muchnik, I.; Holbrook, S.R.; Kim, S.H. Prediction of protein-folding class using global description of amino-acid-sequence. Proc. Natl. Acad. Sci. USA 1995, 92, 8700–8704. [Google Scholar] [CrossRef]
- Faraggi, E.; Xue, B.; Zhou, Y.Q. Improving the prediction accuracy of residue solvent accessibility and real-value backbone torsion angles of proteins by guided-learning through a two-layer neural network. Proteins Struct. Funct. Bioinform. 2009, 74, 847–856. [Google Scholar] [CrossRef]
- Kabsch, W.; Sander, C. Dictionary of protein secondary structure—Pattern-recognition of hydrogen-bonded and geometrical features. Biopolymers 1983, 22, 2577–2637. [Google Scholar] [CrossRef]
- López, Y.; Dehzangi, A.; Lal, S.P.; Taherzadeh, G.; Michaelson, J.; Sattar, A.; Tsunoda, T.; Sharma, A. SucStruct: Prediction of succinylated lysine residues by using structural properties of amino acids. Anal. Biochem. 2017, 527, 24–32. [Google Scholar] [CrossRef]
- López, Y.; Sharma, A.; Dehzangi, A.; Lal, S.P.; Taherzadeh, G.; Sattar, A.; Tsunoda, T. Success: Evolutionary and structural properties of amino acids prove effective for succinylation site prediction. BMC Genom. 2018, 19, 923. [Google Scholar] [CrossRef]
- Ward, J.J.; McGuffin, L.J.; Bryson, K.; Buxton, B.F.; Jones, D.T. The DISOPRED server for the prediction of protein disorder. Bioinformatics 2004, 20, 2138–2139. [Google Scholar] [CrossRef] [PubMed]
- Holland, R.C.G.; Down, T.A.; Pocock, M.; Prlic, A.; Huen, D.; James, K.; Foisy, S.; Draeger, A.; Yates, A.; Heuer, M.; et al. BioJava: An open-source framework for bioinformatics. Bioinformatics 2008, 24, 2096–2097. [Google Scholar] [CrossRef] [PubMed]
- Obradovic, Z.; Peng, K.; Vucetic, S.; Radivojac, P.; Dunker, A.K. Exploiting heterogeneous sequence properties improves prediction of protein disorder. Proteins Struct. Funct. Bioinform. 2005, 61, 176–182. [Google Scholar] [CrossRef] [PubMed]
- Heffernan, R.; Paliwal, K.; Lyons, J.; Dehzangi, A.; Sharma, A.; Wang, J.H.; Sattar, A.; Yang, Y.D.; Zhou, Y.Q. Improving prediction of secondary structure, local backbone angles, and solvent accessible surface area of proteins by iterative deep learning. Sci. Rep. 2015, 5, 11476. [Google Scholar] [CrossRef]
- Islam, M.M.; Saha, S.; Rahman, M.M.; Shatabda, S.; Farid, D.M.; Dehzangi, A. iProtGly-SS: Identifying protein glycation sites using sequence and structure based features. Proteins Struct. Funct. Bioinform. 2018, 86, 777–789. [Google Scholar] [CrossRef]
- Sharma, A.; Lyons, J.; Dehzangi, A.; Paliwal, K.K. A feature extraction technique using bi-gram probabilities of position specific scoring matrix for protein fold recognition. J. Theor. Biol. 2013, 320, 41–46. [Google Scholar] [CrossRef]
- Ashburner, M.; Ball, C.A.; Blake, J.A.; Botstein, D.; Butler, H.; Cherry, J.M.; Davis, A.P.; Dolinski, K.; Dwight, S.S.; Eppig, J.T.; et al. Gene Ontology: Tool for the unification of biology. Nat. Genet. 2000, 25, 25–29. [Google Scholar] [CrossRef]
- Hunter, S.; Jones, P.; Mitchell, A.; Apweiler, R.; Attwood, T.K.; Bateman, A.; Bernard, T.; Binns, D.; Bork, P.; Burge, S.; et al. InterPro in 2011: New developments in the family and domain prediction database. Nucleic Acids Res. 2012, 40, D306–D312. [Google Scholar] [CrossRef]
- Kanehisa, M.; Goto, S.; Sato, Y.; Furumichi, M.; Tanabe, M. KEGG for integration and interpretation of large-scale molecular data sets. Nucleic Acids Res. 2012, 40, D109–D114. [Google Scholar] [CrossRef]
- Finn, R.D.; Tate, J.; Mistry, J.; Coggill, P.C.; Sammut, S.J.; Hotz, H.R.; Ceric, G.; Forslund, K.; Eddy, S.R.; Sonnhammer, E.L.L.; et al. The Pfam protein families database. Nucleic Acids Res. 2008, 36, D281–D288. [Google Scholar] [CrossRef]
- Franceschini, A.; Szklarczyk, D.; Frankild, S.; Kuhn, M.; Simonovic, M.; Roth, A.; Lin, J.Y.; Minguez, P.; Bork, P.; von Mering, C.; et al. STRING v9.1: Protein-protein interaction networks, with increased coverage and integration. Nucleic Acids Res. 2013, 41, D808–D815. [Google Scholar] [CrossRef]
- Weng, S.L.; Huang, K.Y.; Kaunang, F.J.; Huang, C.H.; Kao, H.J.; Chang, T.H.; Wang, H.Y.; Lu, J.J.; Lee, T.Y. Investigation and identification of protein carbonylation sites based on positionspecific amino acid composition and physicochemical features. BMC Bioinform. 2017, 18, 66. [Google Scholar] [CrossRef] [PubMed]
- Celniker, G.; Nimrod, G.; Ashkenazy, H.; Glaser, F.; Martz, E.; Mayrose, I.; Pupko, T.; Ben-Tal, N. ConSurf: Using Evolutionary Data to Raise Testable Hypotheses about Protein Function. Isr. J. Chem. 2013, 53, 199–206. [Google Scholar] [CrossRef]
- Armon, A.; Graur, D.; Ben-Tal, N. ConSurf: An algorithmic tool for the identification of functional regions in proteins by surface mapping of phylogenetic information. J. Mol. Biol. 2001, 307, 447–463. [Google Scholar] [CrossRef] [PubMed]
- Shen, H.B.; Chou, K.C. Nuc-PLoc: A new web-server for predicting protein subnuclear localization by fusing PseAA composition and PsePSSM. Protein Eng. Des. Sel. 2007, 20, 561–567. [Google Scholar] [CrossRef]
- Alkuhlani, A.; Gad, W.; Roushdy, M.; Voskoglou, M.G.; Salem, A.B.M. PTG-PLM: Predicting Post-Translational Glycosylation and Glycation Sites Using Protein Language Models and Deep Learning. Axioms 2022, 11, 469. [Google Scholar] [CrossRef]
- Ahmed, E.; Michael, H.; Christian, D.; Ghalia, R.; Yu, W.; Llion, J.; Tom, G.; Tamas, F.; Christoph, A.; Martin, S.; et al. ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Deep Learning and High Performance Computing. arXiv 2021, arXiv:2007.06225. [Google Scholar]
- Rives, A.; Meier, J.; Sercu, T.; Goyal, S.; Lin, Z.M.; Liu, J.S.; Guo, D.M.; Ott, M.; Zitnick, C.L.; Ma, J.; et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl. Acad. Sci. USA 2021, 118, e2016239118. [Google Scholar] [CrossRef]
- Rao, R.; Bhattacharya, N.; Thomas, N.; Duan, Y.; Chen, X.; Canny, J.; Abbeel, P.; Song, Y.S. Evaluating Protein Transfer Learning with TAPE. Adv. Neural Inf. Process. Syst. 2019, 32, 9689–9701. [Google Scholar] [PubMed]
- Jacob, D.; Ming-Wei, C.; Kenton, L.; Kristina, T. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv 2019, arXiv:1810.04805. [Google Scholar] [CrossRef]
- Lan, Z.C.M.; Goodman, S.; Gimpel, K.; Sharma, P.; Soricut, R. Albert: A lite bert for self-supervised learning of language representations. arXiv 2019, arXiv:1909.11942. [Google Scholar]
- Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov, R.; Le, Q.V. XLNet: Generalized Autoregressive Pretraining for Language Understanding. arXiv 2020, arXiv:1906.08237. [Google Scholar] [CrossRef]
- Wang, H.F.; Wang, Z.; Li, Z.Y.; Lee, T.Y. Incorporating Deep Learning With Word Embedding to Identify Plant Ubiquitylation Sites. Front. Cell Dev. Biol. 2020, 8, 572195. [Google Scholar] [CrossRef]
- Yu, K.; Zhang, Q.F.; Liu, Z.K.; Du, Y.M.; Gao, X.J.; Zhao, Q.; Cheng, H.; Li, X.X.; Liu, Z.X. Deep learning based prediction of reversible HAT/HDAC-specific lysine acetylation. Brief. Bioinform. 2020, 21, 1798–1805. [Google Scholar] [CrossRef]
- Liu, M.; Zhu, F. LkaM-PTM: Predicting PTM sites through multimodal protein features from capturing cross-field information. Artif. Intell. Med. 2026, 171, 103297. [Google Scholar] [CrossRef]
- Varga, J.K.; Ovchinnikov, S.; Schueler-Furman, O. actifpTM: A refined confidence metric of AlphaFold2 predictions involving flexible regions. Bioinformatics 2025, 41, btaf107. [Google Scholar] [CrossRef]
- Li, S.H.; Zhang, J.; Zhao, Y.W.; Dad, F.Y.; Ding, H.; Chen, W.; Tang, H. iPhoPred: A Predictor for Identifying Phosphorylation Sites in Human Protein. IEEE Access 2019, 7, 177517–177528. [Google Scholar] [CrossRef]
- Xu, Y.; Ding, Y.X.; Ding, J.; Wu, L.Y.; Xue, Y. Mal-Lys: Prediction of lysine malonylation sites in proteins integrated sequence-based features with mRMR feature selection. Sci. Rep. 2016, 6, 38318. [Google Scholar] [CrossRef]
- Zhang, N.; Zhou, Y.; Huang, T.; Zhang, Y.C.; Li, B.Q.; Chen, L.; Cai, Y.D. Discriminating between Lysine Sumoylation and Lysine Acetylation Using mRMR Feature Selection and Analysis. PLoS ONE 2014, 9, e107464. [Google Scholar] [CrossRef] [PubMed]
- Ma, X.; Guo, J.; Sun, X. Sequence-Based Prediction of RNA-Binding Proteins Using Random Forest with Minimum Redundancy Maximum Relevance Feature Selection. Biomed Res. Int. 2015, 2015, 425810. [Google Scholar] [CrossRef] [PubMed]
- Peker, M.; Sen, B.; Delen, D. Computer-Aided Diagnosis of Parkinson’s Disease Using Complex-Valued Neural Networks and mRMR Feature Selection Algorithm. J. Healthc. Eng. 2015, 6, 281–302. [Google Scholar] [CrossRef] [PubMed]
- He, S.D.; Ye, X.C.; Sakurai, T.; Zou, Q. MRMD3.0: A Python Tool and Webserver for Dimensionality Reduction and Data Visualization via an Ensemble Strategy. J. Mol. Biol. 2023, 435, 168116. [Google Scholar] [CrossRef]
- Yu, J.L.; Shi, S.P.; Zhang, F.; Chen, G.D.; Cao, M. PredGly: Predicting lysine glycation sites for Homo sapiens based on XGboost feature optimization. Bioinformatics 2019, 35, 2749–2756. [Google Scholar] [CrossRef]
- Chen, T.Q.; Guestrin, C.; Assoc Comp, M. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
- Xu, Y.; Li, L.; Ding, J.; Wu, L.Y.; Mai, G.Q.; Zhou, F.F. Gly-PseAAC: Identifying protein lysine glycation through sequences. Gene 2017, 602, 1–7. [Google Scholar] [CrossRef]
- Ning, Q.; Ma, Z.Q.; Zhao, X.W. dForml(KNN)-PseAAC: Detecting formylation sites from protein sequences using K-nearest neighbor algorithm via Chou’s 5-step rule and pseudo components. J. Theor. Biol. 2019, 470, 43–49. [Google Scholar] [CrossRef]
- Dosset, P.; Rassam, P.; Fernandez, L.; Espenel, C.; Rubinstein, E.; Margeat, E.; Milhiet, P.E. Automatic detection of diffusion modes within biological membranes using back-propagation neural network. BMC Bioinform. 2016, 17, 197. [Google Scholar] [CrossRef]
- Butt, A.H.; Khan, Y.D. Prediction of S-Sulfenylation Sites Using Statistical Moments Based Features via CHOU’S 5-Step Rule. Int. J. Pept. Res. Ther. 2020, 26, 1291–1301. [Google Scholar] [CrossRef]
- Malebary, S.J.; Rehman, M.S.U.; Khan, Y.D. iCrotoK-PseAAC: Identify lysine crotonylation sites by blending position relative statistical features according to the Chou’s 5-step rule. PLoS ONE 2019, 14, e0223993. [Google Scholar] [CrossRef]
- Opitz, D.; Maclin, R. Popular ensemble methods: An empirical study. J. Artif. Intell. Res. 1999, 11, 169–198. [Google Scholar] [CrossRef]
- Rokach, L. Ensemble-based classifiers. Artif. Intell. Rev. 2010, 33, 1–39. [Google Scholar] [CrossRef]
- Hasan, M.M.; Guo, D.J.; Kurata, H. Computational identification of protein S-sulfenylation sites by incorporating the multiple sequence features information. Mol. Biosyst. 2017, 13, 2545–2550. [Google Scholar] [CrossRef] [PubMed]
- Shi, M.H.; Lin, F.X.; Qian, Y.; Dou, L. Research of Imbalanced Classification Based on Cascade Forest. In Proceedings of the IEEE International Conference on Progress in Informatics and Computing (IEEE PIC), Shanghai, China, 17–19 December 2021; pp. 29–33. [Google Scholar]
- Chu, Y.Y.; Kaushik, A.C.; Wang, X.G.; Wang, W.; Zhang, Y.F.; Shan, X.Q.; Salahub, D.R.; Xiong, Y.; Wei, D.Q. DTI-CDF: A cascade deep forest model towards the prediction of drug-target interactions based on hybrid features. Brief. Bioinform. 2021, 22, 451–462. [Google Scholar] [CrossRef] [PubMed]
- Qian, Y.; Ye, S.S.; Zhang, Y.; Zhang, J.M. SUMO-Forest: A Cascade Forest based method for the prediction of SUMOylation sites on imbalanced data. Gene 2020, 741, 144536. [Google Scholar] [CrossRef]
- Rao, H.; Shi, X.Z.; Rodrigue, A.K.; Feng, J.J.; Xia, Y.C.; Elhoseny, M.; Yuan, X.H.; Gu, L.C. Feature selection based on artificial bee colony and gradient boosting decision tree. Appl. Soft Comput. 2019, 74, 634–642. [Google Scholar] [CrossRef]
- Friedman, J.H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
- He, F.; Wang, R.; Gao, Y.X.; Wang, D.L.; Yu, Y.; Xu, D.; Zhao, X.W. Protein Ubiquitylation and Sumoylation Site Prediction Based on Ensemble and Transfer Learning. In Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine (BIBM), San Diego, CA, USA, 18–21 November 2019; pp. 117–123. [Google Scholar]
- Zhang, Y.J.; Xie, R.P.; Wang, J.W.; Leier, A.; Marquez-Lago, T.T.; Akutsu, T.; Webb, G.I.; Chou, K.C.; Song, J.N. Computational analysis and prediction of lysine malonylation sites by exploiting informative features in an integrative machine-learning framework. Brief. Bioinform. 2019, 20, 2185–2199. [Google Scholar] [CrossRef]
- Wang, D.L.; Liu, D.P.; Yuchi, J.K.; He, F.; Jiang, Y.X.; Cai, S.T.; Li, J.Y.; Xu, D. MusiteDeep: A deep-learning based webserver for protein post-translational modification site prediction and visualization. Nucleic Acids Res. 2020, 48, W140–W146. [Google Scholar] [CrossRef]
- Zhao, Y.M.; He, N.N.; Chen, Z.; Li, L. Identification of Protein Lysine Crotonylation Sites by a Deep Learning Framework With Convolutional Neural Networks. IEEE Access 2020, 8, 14244–14252. [Google Scholar] [CrossRef]
- Wei, X.L.; Sha, Y.T.; Zhao, Y.M.; He, N.N.; Li, L. DeepKcrot: A Deep-Learning Architecture for General and Species-Specific Lysine Crotonylation Site Prediction. IEEE Access 2021, 9, 49504–49513. [Google Scholar] [CrossRef]
- Xiu, Q.X.; Li, D.C.; Li, H.L.; Wang, N.; Ding, C. Prediction Method for Lysine Acetylation Sites Based on LSTM Network. In Proceedings of the 7th IEEE International Conference on Computer Science and Network Technology (ICCSNT), Dalian, China, 19–20 October 2019; pp. 179–182. [Google Scholar]
- Li, A.; Deng, Y.W.; Tan, Y.; Chen, M. A Transfer Learning-Based Approach for Lysine Propionylation Prediction. Front. Physiol. 2021, 12, 658633. [Google Scholar] [CrossRef] [PubMed]
- Zhao, Q.; Ma, J.Q.; Wang, Y.; Xie, F.; Lv, Z.B.; Xu, Y.Q.; Shi, H.; Han, K. Mul-SNO: A Novel Prediction Tool for S-Nitrosylation Sites Based on Deep Learning Methods. IEEE J. Biomed. Health Inform. 2022, 26, 2379–2387. [Google Scholar] [CrossRef] [PubMed]
- Liu, Y.; Ye, C.F.; Lin, C.; Wang, Q.; Zhou, J.X.; Zhu, M. Semi-ssPTM: A Web Server for Species-Specific Lysine Post-Translational Modification Site Prediction by Semi-Supervised Domain Adaptation. IEEE Trans. Instrum. Meas. 2024, 73, 2523410. [Google Scholar] [CrossRef]
- Ning, W.S.; Xu, H.D.; Jiang, P.R.; Cheng, H.; Deng, W.K.; Guo, Y.P.; Xue, Y. HybridSucc: A Hybrid-learning Architecture for General and Species-specific Succinylation Site Prediction. Genom. Proteom. Bioinform. 2020, 18, 194–207. [Google Scholar] [CrossRef]
- Chen, Z.; Zhao, P.; Li, F.Y.; Leier, A.; Marquez-Lago, T.T.; Webb, G.I.; Baggag, A.; Bensmail, H.; Song, J. PROSPECT: A web server for predicting protein histidine phosphorylation sites. J. Bioinform. Comput. Biol. 2020, 18, 2050018. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. In Proceedings of the 31st Annual Conference on Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- Meng, L.K.; Chen, X.J.; Cheng, K.; Chen, N.J.; Zheng, Z.T.; Wang, F.Z.; Sun, H.Y.; Wong, K.C. TransPTM: A transformer-based model for non-histone acetylation site prediction. Brief. Bioinform. 2024, 25, bbae219. [Google Scholar] [CrossRef]
- Liang, Y.Y.; Li, M.W. A deep learning model for prediction of lysine crotonylation sites by fusing multi-features based on multi-head self-attention mechanism. Sci. Rep. 2025, 15, 18940. [Google Scholar] [CrossRef]
- Xu, D.L.; Zhu, Y.F.; Xu, Q.; Liu, Y.H.; Chen, Y.; Zou, Y.; Li, L. DTL-NeddSite: A Deep-Transfer Learning Architecture for Prediction of Lysine Neddylation Sites. IEEE Access 2023, 11, 51798–51809. [Google Scholar] [CrossRef]
- Soylu, N.N.; Sefer, E. DeepPTM: Protein Post-translational Modification Prediction from Protein Sequences by Combining Deep Protein Language Model with Vision Transformers. Curr. Bioinform. 2024, 19, 810–824. [Google Scholar] [CrossRef]
- Lv, H.; Dao, F.Y.; Guan, Z.X.; Yang, H.; Li, Y.W.; Lin, H. Deep-Kcr: Accurate detection of lysine crotonylation sites using deep learning method. Brief. Bioinform. 2021, 22, bbaa255. [Google Scholar] [CrossRef] [PubMed]
- Xu, Y.; Ding, J.; Wu, L.Y. iSulf-Cys: Prediction of S-sulfenylation Sites in Proteins with Physicochemical Properties of Amino Acids. PLoS ONE 2016, 11, e0154237. [Google Scholar] [CrossRef] [PubMed]
- Liu, S.; Xue, C.; Fang, Y.; Chen, G.; Peng, X.J.; Zhou, Y.; Chen, C.; Liu, G.Q.; Gu, M.H.; Wang, K.; et al. Global Involvement of Lysine Crotonylation in Protein Modification and Transcription Regulation in Rice. Mol. Cell. Proteom. 2018, 17, 1922–1936. [Google Scholar] [CrossRef] [PubMed]
- Sun, H.J.; Liu, X.W.; Li, F.F.; Li, W.; Zhang, J.; Xiao, Z.X.; Shen, L.L.; Li, Y.; Wang, F.L.; Yang, J.G. First comprehensive proteome analysis of lysine crotonylation in seedling leaves of Nicotiana tabacum. Sci. Rep. 2017, 7, 3013. [Google Scholar] [CrossRef]
- Liu, K.D.; Yuan, C.C.; Li, H.L.; Chen, K.Y.; Lu, L.S.; Shen, C.J.; Zheng, X.L. A qualitative proteome-wide lysine crotonylation profiling of papaya (Carica papaya L.). Sci. Rep. 2018, 8, 8230. [Google Scholar] [CrossRef]
- Li, S.H.; Yu, K.; Wu, G.D.; Zhang, Q.F.; Wang, P.Q.; Zheng, J.; Liu, Z.X.; Wang, J.C.; Gao, X.J.; Cheng, H. pCysMod: Prediction of Multiple Cysteine Modifications Based on Deep Learning Framework. Front. Cell Dev. Biol. 2021, 9, 617366. [Google Scholar] [CrossRef]
- Al-barakati, H.J.; Saigo, H.; Newman, R.H.; Dukka, B.K. RF-GlutarySite: A random forest based predictor for glutarylation sites. Mol. Omics 2019, 15, 189–204. [Google Scholar] [CrossRef]
- Dou, L.J.; Li, X.L.; Zhang, L.C.; Xiang, H.K.; Xu, L. iGlu_AdaBoost: Identification of Lysine Glutarylation Using the AdaBoost Classifier. J. Proteome Res. 2021, 20, 191–201. [Google Scholar] [CrossRef]
- Chung, C.R.; Chang, Y.P.; Hsu, Y.L.; Chen, S.Y.; Wu, L.C.; Horng, J.T.; Lee, T.Y. Incorporating hybrid models into lysine malonylation sites prediction on mammalian and plant proteins. Sci. Rep. 2020, 10, 10541. [Google Scholar] [CrossRef]
- Liu, Y.; Li, A.; Zhao, X.M.; Wang, M.H. DeepTL-Ubi: A novel deep transfer learning method for effectively predicting ubiquitination sites of multiple species. Methods 2021, 192, 103–111. [Google Scholar] [CrossRef]
- Long, H.X.; Sun, Z.; Li, M.Z.; Fu, H.Y.; Lin, M.C. Predicting Protein Phosphorylation Sites Based on Deep Learning. Curr. Bioinform. 2020, 15, 300–308. [Google Scholar] [CrossRef]
- Zahiri, Z.; Mehrshad, N.; Mehrshad, M. DF-Phos: Prediction of Protein Phosphorylation Sites by Deep Forest. J. Biochem. 2023, 175, 447–456. [Google Scholar] [CrossRef] [PubMed]
- Wang, R.L.; Wang, Z.; Wang, H.F.; Pang, Y.X.; Lee, T.Y. Characterization and identification of lysine crotonylation sites based on machine learning method on both plant and mammalian. Sci. Rep. 2020, 10, 20447. [Google Scholar] [CrossRef] [PubMed]
- Lv, H.; Dao, F.Y.; Lin, H. DeepKla: An attention mechanism-based deep neural network for protein lysine lactylation site prediction. iMeta 2022, 1, e11. [Google Scholar] [CrossRef] [PubMed]
- Guan, J.H.; Xie, P.L.; Dong, D.H.; Liu, Q.C.; Zhao, Z.H.; Guo, Y.L.; Zhang, Y.L.; Lee, T.Y.; Yao, L.T.; Chiang, Y.C. DeepKlapred: A deep learning framework for identifying protein lysine lactylation sites via multi-view feature fusion. Int. J. Biol. Macromol. 2024, 283, 137668. [Google Scholar] [CrossRef]
- Wen, B.; Wang, C.W.; Li, K.; Han, P.; Holt, M.V.; Savage, S.R.; Lei, J.T.; Dou, Y.C.; Shi, Z.; Li, Y.; et al. DeepMVP: Deep learning models trained on high-quality data accurately predict PTM sites and variant-induced alterations. Nat. Methods 2025, 22, 1857–1867. [Google Scholar] [CrossRef]
- Yan, Y.; Jiang, J.Y.; Fu, M.Z.; Wang, D.; Pelletier, A.R.; Sigdel, D.; Ng, D.C.M.; Wang, W.; Ping, P.P. MIND-S is a deep-learning prediction model for elucidating protein post-translational modifications in human diseases. Cell Rep. Methods 2023, 3, 100430. [Google Scholar] [CrossRef]
- Dai, Y.H.; Deng, L.; Zhu, F. A model for predicting post-translational modification cross-talk based on the Multilayer Network. Expert Syst. Appl. 2024, 255, 124770. [Google Scholar] [CrossRef]
- Zhu, F.; Deng, L.; Dai, Y.H.; Zhang, G.Y.; Meng, F.W.; Luo, C.; Hu, G.; Liang, Z.J. PPICT: An integrated deep neural network for predicting inter-protein PTM cross-talk. Brief. Bioinform. 2023, 24, bbad052. [Google Scholar] [CrossRef]
- Deng, L.; Zhu, F.; He, Y.; Meng, F.W. Prediction of post-translational modification cross-talk and mutation within proteins via imbalanced learning. Expert Syst. Appl. 2023, 211, 118593. [Google Scholar] [CrossRef]
- Simpson, C.M.; Zhang, B.; Hornbeck, P.; Gnad, F. Systematic analysis of the intersection of disease mutations with protein modifications. BMC Med. Genom. 2019, 12, 109. [Google Scholar] [CrossRef]




| PTM | Tools | Dataset | Window Size | Feature Extraction | Classifier | Result | Website | Ref. | ||
|---|---|---|---|---|---|---|---|---|---|---|
| ACC | AUC | MCC | ||||||||
| Crotonylation | BERT-Kcr | used by Lv et al. [163] | 31 | BERT | BiLSTM | 82.0% | 0.905 | 0.640 | http://zhulab.org.cn/BERT-Kcr_models/data | [10] |
| Lactylation | Auto-Kla | UniProt | 51 | Token embedding, position embedding, transformer encoder | AutoML, MLP | 91.21% ± 1.58% | 0.92 ± 0.0062 | 0.554 ± 0.023 | https://github.com/tubic/Auto-Kla | [33] |
| S-sulphenylation | DeepCSO | UniProtKB | 35 | NUM, EAAC, BE, AAindex, CKSAAP, PSSM | LSTM, CNN, RF, SVM | Arabidopsis thaliana | http://www.bioinfogo.org/DeepCSO | [37] | ||
| 78.6% ± 0.7% | 0.852 ± 0.018 | 0.417 ± 0.032 | ||||||||
| Homo sapiens | ||||||||||
| 77.7% ± 0.6% | 0.822 ± 0.011 | 0.367 ± 0.028 | ||||||||
| Phosphorylation | -- | dbPTM | 21 | AAindex, Binary-encoding, ASA, secondary structure, disordered regions, BP, MF, CC, protein functional, domain data from InterPro, KEGG pathway and functional annotation | RF, SVM | Serine | -- | [44] | ||
| -- | 0.95 | 0.78 | ||||||||
| Threonine | ||||||||||
| -- | 0.97 | 0.77 | ||||||||
| Tyrosine | ||||||||||
| -- | 0.99 | 0.57 | ||||||||
| Succinylation | SSKM_Succ | Training data: PLMD and UniProt Test data: dbPTM | 21 | Information of Proximal PTMs, Grey Pseudo Amino Acid Composition, K-Space, PSAAP | SVM, RF, NB | 80.18% | -- | 0.546 | https://github.com/yangyq505/SSKM_Succ.git | [47] |
| S-sulfenylation | SulSite-GTB | Carroll Lab, RedoxDB and UniProtKB | 21 | AAC, DPC, EBGW, KNN, PSAAP, PsePSSM, PWAAC | GTB | 88.53% | 0.94 | 0.77 | https://github.com/QUST-AIBBDRC/SulSite-GTB/ | [54] |
| Phosphoglycerylation | iDPGK | PLMD | 15 | AAC, PCAAC, AAPC, BLOSUM62, PSSM | DT, RF, SVM | 74.9% | -- | 0.49 | http://mer.hc.mmh.org.tw/iDPGK/ | [74] |
| Succinylation | CNN-SuccSite | PLMD 3.0 | 31 | PspAAC, CKSAAP, PSSM | CNN | 86.79% | -- | 0.489 | http://csb.cse.yzu.edu.tw/CNN-SuccSite/ | [76] |
| Glycosylation | PTG-PLM | UniProt | 31 | ProtBERT-BFD, ProtBERT, ProtALBERT, ProtXLNet, ESM-1b and TAPE | CNN, SVM, LR, RF, and XGBoost | Ngly Site | https://github.com/Alhasanalkuhlani/PTG-PLM | [115] | ||
| 96.5% | 0.978 | 0.902 | ||||||||
| Kgly site | ||||||||||
| 64% | 0.64 | 0.28 | ||||||||
| Formylation | LFPred | UniProt, PLMD and dbPTM | 41, information entropy | AAC, BPF, AAI | KNN | 79.3% | -- | 0.55 | -- | [135] |
| S-sulfenylation | S-Sulfenylation | Conducted by Xu et al. [164] and Hasan et al. [141] | 21 | PseAAC, SVV, SM, PRIM, R-PRIM, FV, AAPIV, RAAPIV | BP-NN | 96.89% | 0.931 | 0.862 | https://www.github.com/ahmad-umt/S-Sulfenylation | [137] |
| Sumoylation | SUMO-Forest | UniProt | 21 | PSAAP, PseAAC, SP, BK | Cascade Forest | Cascade Forest-based cost-matrix | https://github.com/sandyye666/SUMOForest | [144] | ||
| 98.69% | 0.98 | 0.89 | ||||||||
| Cascade Forest based F-measure | ||||||||||
| 98.54% | 0.99 | 0.89 | ||||||||
| -- | 0.797 | 0.287 | ||||||||
| sumoylation | ||||||||||
| -- | 0.868 | 0.431 | ||||||||
| Crotonylation | -- | collected verified Kcr sites on non-histone proteins from papaya | From 2 to 37 | BE, CKSAAP, AAC, EAAC, EGAAC | CNN | 85.64% | 0.853 | 0.335 | http://www.bioinfogo.org/pkcr | [150] |
| Crotonylation | DeepKcrot | Collected from [165,166,167] | 29 | EGAAC, WE | LSTM, CNN, RF | RFEGAAC | http://www.bioinfogo.org/deepkcrot | [151] | ||
| 0.851 | 0.784 | 0.228 | ||||||||
| LSTMWE | ||||||||||
| 0.860 | 0.839 | 0.306 | ||||||||
| CNNWE | ||||||||||
| 0.869 | 0.861 | 0.338 | ||||||||
| Propionylation | -- | PLMD and UniProt | 17 | RNN, LSTM | Transfer learning, SVM | -- | 0.705 | 0.317 | http://47.113.117.61/ | [153] |
| Succinylation | HybridSucc | PLMD 3.0, PhosphoSitePlus and dbPTM | -- | PseAAC, CKSAAP, OBC, AAindex, ACF, GPS, PSSM, ASA, SS, and BTA | DNN, PLR | -- | 0.885 | -- | http://hybridsucc.biocuckoo.org/ | [156] |
| Nitrosylation | Mul-SNO | training set: Li et al. [168], independent test set: DeepNitro | 31 | BiLSTM, BERT | RF, lightgbm, xgboost | 80% | 0.80 | 0.59 | http://lab.malab.cn/∼mjq/Mul-SNO/ | [154] |
| Phosphorylation | PROSPECT | UniProt | 27 | one-of-K, EGAAC and CKSAAGP | CNNone-of-K, CNNEGAAC and RFCKSAAGP | -- | 0.821 | 0.37 | http://PROSPECT.erc.monash.edu/ | [157] |
| Crotonylation | DeepMM-Kcr | the same as those used by lv et al. [163] | 31 | token embedding, Positional embedding, one-hot, AAindex, PWAA | Transformer | 85.56% | 0.9310 | 0.7119 | https://github.com/yunyunliang88/DeepMM-Kcr | [160] |
| Neddylation | DTL-NeddSite | from the literature | 41 | EAAC, One-hot, WE | Transfer learning | -- | 0.818 | -- | https://github.com/XuDeli123/DTL-NeddSite | [161] |
| Multi-PTM | DeepPTM | CPLM | 21 | ProtBERT | ViT | -- | 0.793 (succinylation) | -- | https://github.com/seferlab/deepptm | [162] |
| Glutarylation | iGlu_AdaBoost | Conducted by Al-barakati et al. [169] from PLMD, NCBI, and SWISS-PROT | 23 | 188D, CKSAAP, and EAAC | AdaBoost | 72.07% | 0.63 | 0.36 | -- | [170] |
| Malonylation | Kmalo | PLMD and LEMP | 11~39 | AAC, one hot encoding, Pse-AAC, AAindex, PSSM | hybrid models contain multiple CNNs, random forests and SVM | Mammalian proteins | https://fdblab.csie.ncu.edu.tw/kmalo/home.html | [171] | ||
| 86.6% | 0.943 | 0.480 | ||||||||
| Plant proteins | ||||||||||
| 69.1% | 0.772 | 0.195 | ||||||||
| Ubiquitination | DeepTL-Ubi | PhosphoSitePlus, mUbiSida and PLMD | 31 | one-hot | transfer deep learning method | M. musculus | https://github.com/USTC-HIlab/DeepTL-Ubi | [172] | ||
| 60.4% | -- | -- | ||||||||
| A. nidulans | ||||||||||
| 67.9% | -- | -- | ||||||||
| T. gondii | ||||||||||
| 55.6% | -- | -- | ||||||||
| Phosphorylation | -- | iPhos-PseEn | 13 | BE | CNN, BLSTM | Phosphoserine (S) | -- | [173] | ||
| 92.7% | 0.996 | 0.582 | ||||||||
| Phosphoserine (T) | ||||||||||
| 91.4% | 0.994 | 0.501 | ||||||||
| Phosphoserine (Y) | ||||||||||
| 93.6% | 0.995 | 0.488 | ||||||||
| Phosphorylation | DF-Phos | dbPAF and Phospho.ELM | 33 | CTD, DDE, EAAC, EGAAC, a series of PseKRAAC, GrpDDE, kGAAC, LocalPoSpKaaF, QSOrder, SAAC, SOCNumber, ExpectedValueGKmerAA, ExpectedValueKmerAA, ExpectedValueGAA, ExpectedValueAA | Deep Forest | 78% | -- | 0.51 | https://github.com/zahiriz/DF-Phos | [174] |
| Crotonylation | -- | UniProt and pkcr | 31 | AAC, AAPC, BE, CKSAAP, EAAC, EGAAC and PSSM | SVM, RF | 90% | -- | 0.80 | -- | [175] |
| Lactylation | DeepKla | previous research, Botrytis cinerea | 51 | embedding | CNN, RNN | 93.59% | -- | 0.8783 | http://lin-group.cn/server/DeepKla/ | [176] |
| Lactylation | DeepKlapred | previous research [176] | 51 | Position Embedding, QSOrder, CTD, DDE, DistancePair | Transformer | 96.9% | -- | 0.938 | https://awi.cuhk.edu.cn/~biosequence/DeepKlapred | [177] |
| Acetylation | TransPTM | UniProt | 25 | One-hot, ProtT5 | Transformer | 88% | 0.83 | 0.45 | https://www.github.com/TransPTM/TransPTM | [159] |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Gong, S.; Qu, K. Role of Machine and Deep Learning in Predicting Protein Modification Sites: Review and Future Directions. Information 2025, 16, 1023. https://doi.org/10.3390/info16121023
Gong S, Qu K. Role of Machine and Deep Learning in Predicting Protein Modification Sites: Review and Future Directions. Information. 2025; 16(12):1023. https://doi.org/10.3390/info16121023
Chicago/Turabian StyleGong, Siliang, and Kaiyang Qu. 2025. "Role of Machine and Deep Learning in Predicting Protein Modification Sites: Review and Future Directions" Information 16, no. 12: 1023. https://doi.org/10.3390/info16121023
APA StyleGong, S., & Qu, K. (2025). Role of Machine and Deep Learning in Predicting Protein Modification Sites: Review and Future Directions. Information, 16(12), 1023. https://doi.org/10.3390/info16121023

