Role of Machine and Deep Learning in Predicting Protein Modification Sites: Review and Future Directions

Gong, Siliang; Qu, Kaiyang

doi:10.3390/info16121023

Open AccessReview

Role of Machine and Deep Learning in Predicting Protein Modification Sites: Review and Future Directions

by

Siliang Gong

and

Kaiyang Qu

^*

School of Computer and Software, Nanyang Institute of Technology, Nanyang 473000, China

^*

Author to whom correspondence should be addressed.

Information 2025, 16(12), 1023; https://doi.org/10.3390/info16121023

Submission received: 27 October 2025 / Revised: 13 November 2025 / Accepted: 20 November 2025 / Published: 24 November 2025

Download

Browse Figures

Versions Notes

Abstract

Post-translational modifications (PTMs) of proteins are essential for cellular function. Owing to the high cost and time demands of high-throughput sequencing, machine learning and deep learning methods are being rapidly developed for predicting PTM sites. This manuscript presents a comprehensive review of the current research on the application of intelligent algorithms for predicting PTM sites. It outlines the key steps for identifying modified sites based on intelligent algorithms, including data pre-processing, feature extraction, dimension reduction, and classifier development. This review also discusses potential future research directions in this field, providing valuable insights for advancing the state-of-the-art PTM site prediction. Collectively, this review provides comprehensive knowledge on PTM identification and contributes to the development of advanced predictors in the future.

Keywords:

post-translational modification; feature engineering; machine learning; deep learning

Graphical Abstract

1. Introduction

Protein synthesis follows the central dogma of genetics and involves three primary processes: replication, transcription, and translation. Protein post-translational modification (PTM) occurs during translation. After synthesis, proteins undergo various modifications, such as phosphorylation [1], acetylation [2], methylation [3], ubiquitination [4] and glycosylation [5]. These modifications expand the functional diversity of proteins and significantly increase their complexity. PTMs play crucial roles in cellular and organismal functions, affecting processes such as cell differentiation, apoptosis, protein degradation, protein–protein interactions, and gene expression. Furthermore, PTMs are closely linked to human diseases, and current targeted therapies involve the use of regulatory enzymes associated with these modifications. Therefore, studying protein PTMs is essential for advancing our understanding of biological processes.

Experimental methods are adept at accurately identifying protein modification sites, with mass spectrometry [6] as the predominant approach, complemented by liquid chromatography [7] and radiochemical methods [8]. However, as sequencing technologies continue to advance, an increasing number of protein sequences have been discovered, rendering traditional experimental methods insufficient for managing large datasets. In this context, computational methods have emerged as viable alternatives for analyzing protein sequences and identifying corresponding modification sites. Machine learning and deep learning methods have been successfully used for the identification of modification sites. However, current prediction methods require further improvements [9,10]. In addition, the modification site information provided by computational predictions is speculative, and its biological validity must be experimentally verified.

Machine learning identification methods typically involve six key steps: Initially, data were gathered from the established databases. The collected data were then preprocessed. Subsequently, sequence or structural features were extracted from the protein sequences, including sequence position information, amino acid physicochemical properties, and protein structure. Next, redundant or insignificant features were removed using feature-reduction methods. A suitable model was selected for training, and its performance was evaluated using a test set. Most machine learning identification methods adhere to this fundamental process, as illustrated in Figure 1. Although several articles have summarized the methods for identifying PTM sites [11], this study not only focuses on machine learning workflows to compare and summarize the methods used in each step, but also pays special attention to issues such as the data limitations of modification sites, interpretability of predictive models and features, multi-label prediction, and modification crosstalk. This study aimed to provide an overview of the existing problems and future challenges in this field.

2. Datasets and Data Pre-Processing

2.1. Dataset

With continuous advancements in sequencing technology and proteomics, researchers have developed numerous PTM databases that can be used by other researchers in the field. The following section provides an overview of several popular databases, with additional information in Table 1.

2.1.1. UniProt

UniProt [12,13] is a comprehensive database that offers protein structure, sequence information, functional annotations, Gene Ontology (GO) annotations, subcellular location data, PTM information, and similar proteins. The UniProt database is an authoritative repository of protein sequences and functional information that systematically integrates PTM annotations. After manual curation, these annotations are stored in Swiss-Prot entries and centrally displayed in the dedicated “PTM/Processing” module, covering various modification types, such as phosphorylation and glycosylation, with specific amino acid residue sites of modification clearly annotated. The data sources included published experimental evidence and reliable computational predictions, which are often linked to relevant literature. Moreover, PTM information does not exist in isolation but is deeply interconnected with modules such as “Function” “Disease and Variants” and “Sequence”, collectively elucidating the biological significance of PTMs in regulating protein activity, localization, interactions, and stability. Simultaneously, UniProt provides cross-references to specialized PTM databases, such as PhosphoSitePlus and GlyGen, serving as a comprehensive PTM information hub that guides users in their further exploration.

2.1.2. dbPTM

dbPTM [14,15] is a comprehensive resource for PTM studies. The database contains 2,235,664 experimental PTM sites and over 70 PTM types integrated into more than 40 databases, including 30 benchmark datasets. Additionally, dbPTM provides information on the association between modification sites and diseases, which can be valuable for disease research. Researchers can select specific modification sites for data download by clicking the download bar. Instead of providing the entire protein sequence, the database offers protein fragments with details of the modification site; each fragment is 21 amino acids in length.

2.1.3. CPLM 4.0

The Compendium of Protein Lysine Modifications 4.0 (CPLM 4.0) [16] is a comprehensive data resource that builds on previous versions of the CPLA [17], CPLM [18], and PLMD [19]. CPLM 4.0 focuses on the modification of lysine residues in proteins. This database includes a significant number of modification events that encompass a wide range of unique sites on various proteins. In total, 105,673 proteins were included in CPLM 4.0, with data pertaining to up to 29 different types of protein lysine modifications across 219 species [16].

Table 1. Commonly used datasets for PTM studies.

Name	Website	PTM Type	Statistics
UniProt [12,13]	https://www.uniprot.org/	Multiple	570,420 reviewed proteins, 251,131,639 unreviewed proteins
dbPTM [14,15]	https://biomics.lab.nycu.edu.tw/dbPTM/	Multiple	2,235,664 sites, 70+ PTM types, 40+ integrated databases, 30+ benchmark datasets
PhosphoSitePlus [20]	https://www.phosphosite.org/homeAction	Multiple	59,469 PTM sites, 13 PTM types
CPLM 4.0 [16]	http://cplm.biocuckoo.cn/	Multiple	463,156 unique sites of 105,673 proteins for up to 29 PLM types across 219 species
qPTM [21]	http://qptm.omicsbio.info/	Multiple	11,482,553 quantification events for 660,030 sites on 40,728 proteins under 2596 conditions
PupDB [22]	https://cwtung.kmu.edu.tw/pupdb/	Pupylation	268 pupylation proteins with 311 known pupylation sites and 1123 candidate pupylation proteins
DEPOD [23]	https://depod.bioss.uni-freiburg.de/	Phosphorylation	194 phosphatases have substrate data
O-GlcNAcAtlas [24]	https://oglcnac.org/atlas/	O-GlcNAcylation	16,877 Unambiguous sites, 10,058 ambiguous sites
Phospho.elm [25]	http://phospho.elm.eu.org/	Phosphorylation	42,914 instances, 11,224 sequences
CarbonylDB [26]	https://carbonyldb.missouri.edu/CarbonylDB/index.php/	Carbonylation	1495 proteins, 3781 PTM sites, 21 species
Scop3P [27]	https://iomics.ugent.be/scop3p/index	Phosphorylation	108,130 modifications, 20,394 proteins
O-GlycBase [28]	https://services.healthtech.dtu.dk/datasets/OglycBase/	O-Glycosylation	242 proteins
dbSNO [29]	http://140.138.144.145/~dbSNO/index.php	S-nitrosylation	174 experimentally verified S-nitrosylation sites on 94 S-nitrosylated proteins
UbiNet 2.0 [30]	https://awi.cuhk.edu.cn/~ubinet/index.php	Ubiquitination	3332 experimentally verified ESIs
UbiBrowser 2.0 [31]	http://ubibrowser.bio-it.cn/ubibrowser_v3/	Ubiquitination	1,884,676 predicted high-confidence ESIs, 8,341,262 potential E3 recognizing motifs, 4068 known ESIs from literature
PhosPhAt [32]	https://phosphat.uni-hohenheim.de/	Phosphorylation	10,898 phosphoproteins, 64,128 serine sites, 13,102 threonine sites, 2672 tyrosine sites

2.2. Data Pre-Processing

Data pre-processing involves three primary steps. First, the protein sequence was segmented to generate fragments. The next step was to collect trustworthy negative data to construct the datasets. Finally, the problem of imbalanced datasets was investigated to mitigate potential adverse effects. The data pre-processing workflow is shown in Figure 2.

2.2.1. Sequence Slice

In current studies on PTM, most researchers have chosen the peptide representation method outlined in Equation (1).

S = P_{- ε} \dots P_{- 2} P_{- 1} P_{0} P_{1} P_{2} \dots P_{ε}

(1)

P_{0}

ignifies the central amino acid in PTM site recognition studies, which aims to determine whether these amino acids undergo modifications. For example, in methylation recognition studies,

P_{0}

represents either lysine or arginine. The

P_{- ε} \dots P_{- 2} P_{- 1}

represents the upstream

ε

-th amino acid from the central amino acid

P_{0}

, whereas

P_{1} P_{2} \dots P_{ε}

denotes the downstream

ε

-th amino acid from the central amino acid

P_{0}

. Therefore, the length of the peptide is

2 ε + 1

. It is customary to use ‘-’ or ‘X’ as placeholders when there are insufficient upstream or downstream amino acids.

The lengths of the peptides used in PTM site studies have varied from one research project to another. Lai et al. [33] developed an auto-machine learning method to predict the lysine lactylation sites in 51 amino acids. Wei et al. [34] used 11 residues to predict methylation sites. Li et al. [35] used peptides of 31 amino acids. Sua et al. [36] trained on 27-peptide sequence segments and tested 20 of them. Lyu et al. [37] segmented proteins into 35-residue segments with cysteine at the center. Auliah et al. [38] used a local sliding window of 57 residues to predict pupylation sites. Bao et al. [39] created 27-tuple peptides for K-PTMs. The peptide lengths available in dbPTM and PhosphoSitePlus were 21 and 15 bp, respectively. The choice of peptide length may affect the prediction results, leading researchers to explore peptides of various lengths. Khalili et al. [40] investigated window sizes ranging from seven to 35 and found that a window size of 13 yielded the best performance in their models.

2.2.2. Sequence Redundancy

CD-HIT, a method proposed by Li et al. [41], has been widely used to eliminate homologous protein data. CD-HIT employs a clustering approach to identify and remove similar protein sequences from the dataset. CD-HIT is widely used in PTM-site studies to eliminate homologous protein sequences and residual fragments. Some studies have aimed to remove redundant protein sequences [35,42], whereas others have focused on eliminating fragmented residue sequences [43,44].

2.2.3. Selected Reliable Negative Sequences

Although databases commonly provide information on modification sites, they do not provide information on non-modification sites. Peptides without annotation information (known as unlabeled data) can be categorized into two groups: first, the site has not been identified as a modification site, and second, additional research is required to verify whether the site has been modified. Therefore, we obtained reliable positive sequences; however, there may be potential issues with negative data. The modification site prediction problem can be described as a positive-unlabeled (PU) problem. Two methods can be used to address the PU problem in the modification site prediction.

Segments without modification information were considered negative samples [33]. The proposed method is simple and manageable. Several studies have used this method to develop models. However, these methods ignore the potential for such modifications. Gao et al. [45] established three criteria for identifying non-phosphorylated sites: (1) the fragment must not be labeled as a positive site, (2) the fragment should be within the sequence containing a positive site, and (3) the negative site must be solvent-inaccessible. In [40,46], all residues from a protein with a minimum of three confirmed positive sites were considered to be negative sites.

Another method aims to build models using limited positive data and large amounts of unlabeled data. Ning et al. [47] used semi-supervised learning and a support vector machine (SVM) to select reliable negative data. Jiang et al. [48] introduced the PUL-PUP algorithm to acquire negative data. PUL-PUP initially used similarity to identify negative data that are distant from positive data. Subsequently, PUL-PUP iteratively trains the SVM to expand the reliable negative data.

2.2.4. Balanced Dataset

In PTM studies, the amount of positive data is less than that of negative data, leading to a data imbalance. Imbalanced datasets may have detrimental effects on model training. Various methods have been proposed, such as data-based, algorithm-based, and hybrid-based methods.

(1): Data based methods

Data-based methods can be divided into oversampling [49], undersampling [50], and hybrid sampling [51,52,53]. Random oversampling (ROS) randomly duplicates minority samples to create a balanced dataset. The synthetic minority oversampling technique (SMOTE) [49] generates new samples by considering the k-nearest neighbors of each minority class sample and is widely used to address imbalanced datasets, such as SulSite-GTB [54]. Adaptive synthetic sampling (ADASYN) [50] is an advanced method that can generate samples based on the learning difficulties of individual minority groups. Balanced datasets can also be obtained through undersampling, which involves reducing the number of majority samples in the dataset. Random undersampling (RUS) selects sequences from the majority of subsets and is widely used for PTM prediction. NearMiss [55] selects a subset of majority class samples closest to minority class samples as representatives. ENN [56] is a KNN-based method that avoids interference from samples. Hybrid sampling methods, such as SMOTETomek [57,58] and SMOTEENN, combine oversampling and undersampling.

(2): Algorithm-based method

Algorithm-based methods solve imbalanced datasets by shifting their focus toward minority class samples, including cost-sensitive learning [59], ensemble learning [60,61], and one-class classification [61,62]. Cost-sensitive learning uses cost functions to build classifiers by minimizing the misclassification cost and adjusting the cost of misclassified samples based on a cost matrix [63,64,65]. Commonly used methods include weighted SVM [66], fuzzy SVM [67,68], and cost-sensitive neural networks [69]. Ensemble learning can reduce the bias caused by a single learner and enhance the model efficiency. Bagging, boosting, stacking and hybrid models are the most representative ensemble learning models. RUSBOOST [70] trains weak classifiers by constructing multiple balanced datasets using a combination of random undersampling (RUS) and AdaBoost. Jia et al. [71] used an ensemble method to predict the O-GlcNAcylation sites. One-class learning, or novelty detection, is a useful approach for handling significant imbalances between positive and negative samples. This technique builds models for the minority class, for example, one-class SVM [72].

(3): Hybrid-based methods

Based on the aforementioned methods for handling imbalanced datasets, researchers have proposed hybrid methods that combine different balancing strategies to further enhance model performance. For instance, Islam et al. [73] used undersampling and K-nearest neighbor (KNN) to balance the data. In [70], a combination of RUS and AdaBoost was used to balance the dataset. Reference [51] implemented hybrid sampling and bagging classifier to address the problem of imbalanced dataset.

2.2.5. Data Splitting

To train a model and objectively assess its generalization capability, the dataset must be divided into training and test sets according to specific guidelines. The holdout method, a widely used and straightforward approach, involves splitting data according to a predefined ratio, such as 70% for training and 30% for testing. Alternatively, a temporal partitioning strategy can be employed using data collected before 2022 for training and data from 2022 to 2025 for testing. Regardless of the chosen method, it is crucial to ensure that the training and test sets have similar data distributions and that no test data are included in the training set.

3. Feature Engineering

Feature engineering mainly includes two parts: feature extraction and feature dimensionality reduction. This section separately introduces the application of these two components in PTM prediction.

3.1. Feature Extraction

Machine learning cannot directly recognize the sequence data. Therefore, researchers must design algorithms to complete feature extraction. Feature extraction involves encoding sequences to transform them into features suitable for machine learning models. Feature extraction methods are categorized into four types: (1) sequence-based, (2) physicochemical-based, (3) annotation-based and (4) deep learning-based.

3.1.1. Sequence-Based Feature

Sequence-based methods commonly use the composition, position, and other relevant information on amino acids to achieve a numerical representation of protein sequences. In sequence-based methods, a placeholder (‘-’ or ‘X’) is considered a particular type of amino acid. So, there are 21 amino acid residues in sequence segments.

Amino acid composition (AAC) [74,75,76,77] is a common feature extraction method based on the frequency of amino acids in the sequence segments. AAC can yield 21 features, including 20 amino acids and one placebo feature. The composition of k-spaced amino acid pairs (CKSAAPs) uses the frequency of amino acid pairs with separation of k spaces to represent the protein sequence, where the value of k is set by the researcher. CKSAAP has 441 features (from AA, AC, to XX) when k is 0. One-hot encoding [76] is commonly used in deep learning, where each amino acid is converted into a vector of length 21. Finally, a protein fragment of length L was depicted as a two-dimensional (2D) matrix of size L × 21. The use of machine learning methods to extract features is an innovative approach. The K-nearest neighbor (KNN) Score [78,79,80] was used to characterize the fragments.

In addition to the methods discussed above, several other sequence-based feature extraction methods exist, for instance, the conjoint triad descriptor (CTriad) [77,81], dipeptide composition (DPC) [77,82], amino acid pair composition (AAPC) [74,76,83], pair potential [45,84], four-body statistical pseudo-potential [45,85], local structural entropy [45,86], information on proximal PTMs [47], position-specific amino acid propensity (PSAAP) [47,87,88], enhanced amino acid pair (EAAC), enhanced group amino acid pair (EGAAC) [76,89], and position weight amino acid composition (PWAAC) [54] descriptors.

3.1.2. Physicochemical-Based Feature

The physicochemical properties of amino acids directly influence their interactions with enzymes, behavior in three-dimensional protein structures, and suitability as substrates for modification. Based on these properties, the 20 amino acids can be regarded as functional units, each exhibiting distinct characteristics. These properties were assessed for each amino acid using a well-established metric. For physicochemical properties, placeholders (‘-’ or ‘X’) were either disregarded or assigned a default value of 0.5.

The AAindex database, comprising 566 amino acid properties [25,77,90,91,92], can be integrated with computational techniques such as gray models, principal component analysis, and clustering to enhance feature extraction efficiency. The secondary structure (SS) [77,93] was determined using SPIDER2, which converted fragments into a 63-dimensional vector that included probability scores for α-helix, β-helix, and coil for each amino acid (with placeholders). The composition, transition, and distribution (CTD) method [77,94,95] introduced by Dubchak et al. categorizes 20 amino acids into three groups based on eight properties, ultimately transforming the fragments into an 188-dimensional vector.

Some commonly used feature extraction methods include backbone torsion angles (BTAs) [77,91,96], accessible surface area (ASA) [45,77,97,98,99], physicochemical properties (PCPs), other binding sites for any chemical groups [91], positively charged amino acid composition (PCAAC), regions discorded by DISOPRED2 [91,100], BioJava [91,101], disorder [45,102], gray pseudo amino acid composition [47], and encoding based on grouped weight (EBGW) [54].

3.1.3. Annotation-Based Feature

Protein annotation typically encompasses basic, structural, and functional details, as well as other pertinent information. These data aid in comprehending the structure, function, and significance of proteins within organisms and are frequently used to describe protein fragments.

There are numerous annotation-based methods, such as the position-specific scoring matrix (PSSM) [73,74,76], evolutionary-based profile bigrams [73,103,104,105], gene ontology (GO) [91,106], InterPro [91,107], KEGG [91,108], Pfam [91,109], STRING [91,110], functional domain [91], active site [91], natural variants [91], BLOSUM62 scoring matrix (B62) [74,111], evolutionary conservation score [45,112,113], and pseudo-position-specific scoring matrix (PsePSSM) [54,114].

PSSM is the most popular method. PSSM is a 20 × L matrix, where L represents the length of the fragment. Each column corresponds to a residue position in the protein sequence, and each row represents one of the 20 possible amino acids. Each element (i, j) in the matrix represents the probability or score of the j-th position in the protein sequence being mutated into the i-th amino acid during evolution. This score typically reflects the degree of conservation and preference for a specific amino acid at that position.

3.1.4. Deep Learning-Based Feature

Deep learning is extensively employed in PTM site recognition research, serving dual purposes: as a classifier for prediction and as a tool for extracting features from the network structures. This section focuses on its role as a feature extraction tool.

Convolutional neural networks (CNNs) extract features and reduce dimensionality via convolutional and pooling layers, whereas recurrent neural networks (RNNs) capture the sequence context. Some studies have integrated these approaches into hybrid feature extraction methods. Natural language processing (NLP) has rapidly developed in recent years. Protein sequences are similar to those in natural languages in several respects. First, both datasets are sequential. Second, both contain contextual information. Several studies have used language models to extract protein fragment features. Bidirectional encoder representations from transformers (BERTs) are typically used to predict PTM sites. Alkuhlani et al. [115] used six protein language models, ProtBERT-BFD [116], ProtBERT [116], ProtALBERT [116], ProtXLNet [116], ESM-1b [117], and TAPE [118] to identify PTM sites based on BERT [119], Albert [120], and XLNet [121]. Qiao et al. used BERT to build a novel predictor, BERT-Kcr, for protein Kcr sites prediction [10]. Lyu et al. [37] used word embedding to encode protein fragments, whereas Wang et al. [122] predicted plant ubiquitination by using the word2vec feature extraction method. PTM is intrinsically linked to enzymes, with enzyme-substrate relationships deduced from the physical and chemical properties of the modification sites through feature engineering. Deep-PLA [123] has demonstrated the effective integration of enzyme-specific constraints into deep neural network architectures, offering a relevant case study for this topic.

Sequence-based features primarily leverage the relative positions and compositional information of amino acids in protein sequences. Although these features provide high interpretability, they often fail to capture the physicochemical properties of amino acids adequately. Physicochemical property features, such as the AAindex, are generally fixed values that are determined beforehand. These features do not reflect the actual state of an amino acid within the three-dimensional structure of a specific protein or cellular environment. In reality, amino acids interact, work together, and produce powerful synergistic effects, and fixed values cannot represent them because they are nonlinear, dynamic, and complex. Annotation-based features offer an alternative perspective for studying PTM. However, this method relies heavily on homologous sequences. Annotation-based features are unreliable for novel proteins without known homologous sequences. Additionally, regions of high evolutionary conservation do not always correspond to modification sites, whereas some modification sites may lack high conservation. Sequence features extracted using deep learning are typically derived from one-dimensional amino acid sequences. This generation process resembles a black box, limiting the biological interpretability of these features. Because these features are primarily based on sequence data, they fail to effectively incorporate the three-dimensional structural information of proteins in real space. Although some studies, such as the LkaM-PTM [124] model, have begun to integrate structural information into feature construction, enabling deep learning-based sequence features to accurately reflect spatial structures remains challenging. In addition to the methods mentioned above, prediction methods based on modern protein structures (such as AlphaFold2 [125]) also deserve further investigation.

Therefore, integrating multiple feature types has become the mainstream method for overcoming the limitations of single-feature methodologies. Studies typically perform dimensionality reduction and feature selection on high-dimensional feature sets to identify the most informative features from a vast feature space, thereby constructing a compact yet highly discriminative feature subset.

3.2. Feature Reduction

We introduced four types of feature extraction methods. Several studies have used multiple feature extraction methods to obtain comprehensive feature sets. However, the ensemble method often leads to the challenge of having an excessive number of feature dimensions. Redundant and nonessential features can reduce the efficiency of the predictor. Important features were retained through feature reduction, whereas those with lower importance were removed. Consequently, the final feature vector exhibited a lower dimensionality and higher relevance. Feature-reduction approaches can be divided into two types: feature selection, which only reduces the number of features, and feature transformation or dimensionality reduction, which focuses on decreasing complexity by transforming existing features. These two methods are commonly used to optimize feature sets in modification-site prediction studies.

Auliah et al. [38] used a chi-squared test to perform feature selection. The chi-square test is a widely used hypothesis-testing method that is important in statistical analysis. The chi-squared test was used to examine whether the two variables were independent of each other. Maximal-relevance maximal-distance (MRMD) was used to rank the importance of features [34]. Li et al. [126] combined the analysis of variance (ANOVA) with incremental feature selection (IFS) to find the most vital feature subset. The minimum redundancy maximum relevance (mRMR) [127,128,129,130] is often used to select optimal features from the entire feature set. He et al. [131] proposed a feature selection method called MRMD3.0, which comprises two steps: the first step contains nine feature ranking methods (tree importance, ANOVA, variance threshold, chi-squared, linear model method, mutual information, mRMR, MRMD, and recursive feature elimination) and four link analysis strategies (PageRank, Trust Rank, Leader Rank, and HITS). The second step uses the IFS to select the best feature subset. Ensemble methods are typically used for the feature selection. Yu et al. [132] selected features using XGBoost [133]. Principal component analysis (PCA) is widely used to reduce the dimensionality of data. This method can describe existing high-dimensional feature sets using fewer comprehensive features. Singular value decomposition (SVD) is another standard method used for feature dimensionality reduction in the literature.

Unlike in traditional machine learning, deep learning does not typically involve an explicit, standalone process for feature dimensionality reduction. However, certain intrinsic components of deep learning models, such as pooling layers and dropout layers, inherently perform functions similar to feature selection and dimensionality reduction while executing their primary roles, such as downsampling or mitigating overfitting.

4. Classifiers

This section is dedicated to an individual treatment of how machine learning and deep learning methods are applied to PTM site identification.

4.1. Machine Learning

Support vector machines (SVMs) separate protein fragments by creating an optimal hyperplane for classification, which is particularly effective for small sample data. Xu et al. [134] used SVM to identify protein lysine glycation using sequences. Bao et al. [39] used an SVM and multilayer neural networks to predict various PTM sites. Auliah et al. [38] used multiple classifiers to assess the recognition efficiency of PUP-fusion. Decision trees (DTs) use if and then judgment rules for protein fragment classification. K-Nearest Neighbor (KNN) predicts unknown protein fragments based on the nearest samples. Ning et al. [135] used KNN as a classifier to identify formylation sites and explored the impact of different values of K on experimental results. Artificial Neural Networks (ANNs) are popular classifiers in bioinformatics that mimic the structures and functions of biological neural networks [136]. Several studies have used ANN to predict PTM sites [137,138].

Ensemble methods can enhance the prediction accuracy. Ensemble methods can be categorized into three types: Bagging, Boosting, and Stacking. Bagging is similar to the voting process. Basic classifiers have been used to predict protein fragments, resulting in various outcomes. The category of unknown protein fragments was determined based on the most frequent category. Random Forest (RF) [139,140] is a representative bagging algorithm commonly used for PTM site prediction. Hasan et al. [141] used RF to predict S-sulfenylation sites. The Cascade Forest [142,143] uses a layered approach comprising multiple forest structures, where the input for each layer is derived from the output feature information of the preceding layer. This methodology facilitates incremental feature extraction and allows for adaptive adjustments to the complexity of the model. Qian et al. [144] proposed a novel predictor, the SUMO-forest, based on cascade forests. Boosting refines the base learner by adjusting the data sample weights and ultimately determines the segment classes through weighted voting. Gradient tree boosting (GTB) [145,146] is a popular boosting method that uses multiple DT with excellent performance and has been applied in multiple fields. Wang et al. [54] proposed SulSite-GTB to predict S-sulfenylation sites based on GTB. The stacking method uses multiple base learners to identify protein fragments and generate classification results. These classification results are then used as features for another learner to learn and produce final classification results. He et al. [147] used stacking ensemble layers to build a predictor in which the base learners were convolutional neural networks with different specifications. In addition to the aforementioned ensemble methods, other hybrid methods have been developed. Zhang et al. [148] integrated five classifiers, RF, SVM, GBDT, KNN and Logistic Regression, to predict lysine malonylation sites.

In PTM site prediction, machine learning methods offer several advantages, including lightweight architectures, fast training, and the ability to produce effective models from limited data when supported by high-quality feature engineering. ML methods also retain strong interpretability in the decision-making process. However, these methods rely on researcher-crafted feature sets, such as physicochemical properties and evolutionary information derived from protein sequences. This manual feature extraction paradigm may fail to capture the complex latent patterns in sequences.

4.2. Deep Learning

CNNs consist of convolutional, pooling, and fully connected layers. The convolution layer extracts features from the input data, and the pooling layer selects these features. The fully connected layer classifies unknown protein fragments. Wang et al. [149] used CNN to predict multiple PTMs. Zhao et al. [150] used a CNN to predict the Kcr. CNN-SuccSite [76] was developed as a CNN model for predicting lysine succinylation sites and comprises an input layer, two convolution layers, two max-pooling layers, two fully connected layers, and an output layer. Wei et al. [151] created a one-dimensional (1D) CNN to predict Kcr sites. However, RNNs are advantageous for sequential data because of their ability to handle sequences of varying lengths and capture temporal dependencies. RNNs provide rich contextual information and include two common variants: long short-term memory (LSTM) [152] and gated recurrent units (GRUs). Lyu et al. [37] constructed a five-layer LSTM model featuring an input layer, word embedding layer, LSTM layer, dense layer, and output layer to predict cysteine sulfophenylation sites. Li et al. [153] proposed a transfer learning model based on LSTM to predict lysine propionylation, whereas Mul-SNO [154] combined bidirectional long short-term memory (BiLSTM) and bidirectional encoder representations from transformers (BERTs) to predict S-nitrosylation sites. Yu et al. [155] used a CNN-LSTM hybrid network for feature extraction and prediction. Currently, some studies have integrated deep learning with machine learning to enhance prediction efficiency. Ning et al. [156] combined 4-layer DNN and penalized logistic regression for succinylation site prediction. PROSPECT [157], proposed by Chen et al., integrates two CNNs and an RF to predict phosphorylation sites.

Transformer [158] is a deep learning model that uses a self-attention mechanism. Its primary innovation is the ability to capture global dependencies among all elements in a sequence through parallel computation, which enhances the model’s capability to contextualize information within that sequence. Transformer architecture comprises an encoder and decoder, with each layer featuring multi-head self-attention mechanisms and feed-forward neural networks. This design enabled the model to dynamically assess the significance of each amino acid in the input sequence. Meng et al. [159] proposed TransPTM, a transformer-based neural network model for predicting nonhistone acetylation sites. Liang et al. [160] proposed an effective model named DeepMM-Kcr, which is based on multiple features and a multi-head self-attention mechanism.

The core idea of transfer learning is to apply knowledge, including model parameters and feature representations, acquired from solving one task (the source task) to another related but distinct task (the target task). This application enhances the learning efficiency and performance of the new tasks. In some instances, the data for certain modification sites may be limited, potentially resulting in insufficient annotated data to train high-performance models. The utilization of transfer learning methods can effectively address this challenge. Xu et al. [161] developed DTL-NeddSite, a convolutional neural network-based predictor that leverages deep transfer learning and one-hot encoding. The model was first trained on a large dataset of lysine PTM sites and then fine-tuned using neddylation site data to construct the target model. Soylu et al. [162] developed the DEEPPTM model by integrating a protein embedding approach using ProtBERT with an attention-based Vision Transformer (ViT) to enhance the modification prediction accuracy and elucidate the relationships between modification types and protein sequences.

Deep learning can directly learn hierarchical feature representations from raw protein sequences in an end-to-end manner, thereby eliminating the need for labor-intensive manual feature engineering that relies on prior knowledge of the protein. The DL method often achieves higher prediction accuracy when it is supported by large-scale datasets. However, it also has notable limitations: the models typically require large amounts of annotated data, substantial computational resources. Effectively integrating multi-source heterogeneous biological data, such as gene expression profiles, protein–protein interaction networks, and evolutionary information, remains a key challenge in enhancing model generalization. Furthermore, designing neural network architectures that can capture long-range dependencies in sequences while incorporating structural spatial information is critical in deep learning model research. The integration of attention mechanisms with graph neural networks presents a promising approach for preserving the topological relationships between sequences and structures when modeling residue interactions. Leveraging pre-training strategies to learn general representations from unlabeled data can mitigate the scarcity of annotated samples. Furthermore, incorporating interpretability techniques to explore biologically significant regions highlighted by the model aids in elucidating the underlying molecular mechanisms, thereby advancing deep learning from ‘black-box’ prediction to mechanistic discovery.

5. Measurement

The prediction of PTM sites is a binary classification problem. The modified sites were divided into positive and unlabeled data. Typically, unlabeled data are considered negative samples. Researchers typically use accuracy (ACC), Matthews correlation coefficient (MCC), F-measure, and area under the receiver operating characteristic curve (AUC) to assess classifier performance. The formulas for these metrics are as follows:

A C C = \frac{T P + T N}{T N + T P + F N + F P}

(2)

M C C = \frac{(T P \times T N) - (F N \times F P)}{\sqrt{(T P + F N) (T N + F P) (T P + F P) (T N + F N)}}

(3)

F 1 = \frac{2 \times \frac{T P}{T P + F P} \times \frac{T P}{T P + F N}}{\frac{T P}{T P + F P} + \frac{T P}{T P + F N}}

(4)

S n = \frac{T P}{T P + F N}

(5)

S p = \frac{T N}{T N + F P}

(6)

where TP denotes the number of correct classifications in the positive dataset. TN represents the number of correct classifications in a negative dataset. FN is the number of false-negative results, where FP is the number of false positives.

The receiver operating characteristic (ROC) curve was originally used for radar-signal detection to differentiate between signals and noise. Subsequently, the researchers adopted it for the model evaluation. The horizontal axis of the ROC curve represents the false positive rate (FPR), and the vertical axis represents the true positive rate (TPR). Owing to the curved nature of the ROC, assessing the quality of the model can be challenging. Therefore, in practical applications, the area under the curve (AUC) serves as a measure of the model performance and is particularly beneficial for handling unbalanced data.

6. Summary of Predictors

With advancements in machine and deep learning, there has been a surge in research focusing on predicting PTM sites. Table 2 summarizes various studies, including PTM types, datasets, window sizes, feature extraction methods, classifiers, results, and web servers.

According to Table 2, combined with the results obtained from keyword searches in the Web of Science, we generated Figure 3. This figure illustrates the percentage distribution of research papers published over the past five years, focusing on 15 types of modification sites. As illustrated in Figure 3, there is a significant imbalance in the research focus across various modification sites. Most studies focused on a limited range of modification types, including phosphorylation, ubiquitination, acetylation, and glycosylation, which together represent approximately 61.8% of the published literature. Conversely, other modification sites, notably propionylation and formylation sites, have received minimal attention.

Figure 4 illustrates the percentage distribution of various feature extraction methods in the existing literature, highlighting their relative importance. Traditional feature extraction methods continue to dominate, with sequence composition features (e.g., AAC and CKSAAP) and evolutionary information features (e.g., PSSM) being the most prevalent. Furthermore, features based on the physicochemical properties of amino acids (e.g., AAindex) have been widely employed in numerous models. In recent years, embedding methods that utilize pretrained language models, such as BERT, have demonstrated significant potential, positioning deep-learning-based feature extraction as a current research focus. Most models strategically combine multiple feature types to comprehensively capture diverse aspects of sequence information. Although some studies have incorporated certain structure-related features (such as ASA and SS), research that directly utilizes three-dimensional structural information remains limited.

As shown in Table 2, deep learning exhibits exceptional performance in various PTM prediction tasks. Models utilizing deep neural network architectures, such as CNN, LSTM, and transformers, have been used to predict emerging modification types, including crotonylation and lactylation, achieving a peak ACC of 96.9%. In contrast, traditional machine learning methods, while remaining somewhat competitive in predicting classical modification types, such as phosphorylation, offer advantages in training efficiency and model interpretability. However, their performance is often limited by the quality of manual feature engineering. Notably, current research reveals a significant technological convergence trend, particularly through the integration of learning strategies that combine features extracted by deep learning with traditional machine learning classifiers. This approach not only sustains high performance but also enhances the model’s generalization capabilities. Such systematic comparisons provide empirical evidence for method selection in the field of PTM prediction and suggest that future research should concentrate on developing novel computational frameworks that achieve both high predictive accuracy and strong biological interpretability.

7. Challenges and Future Directions

Significant progress has been made in the identification of PTM sites using machine learning and deep learning techniques. However, several noteworthy aspects warrant further investigations.

7.1. Data Limitations

The study of PTM sites based on machine and deep learning requires a large and accurate dataset for training the model. As previously mentioned, there are three main issues concerning data on modification sites: (1) the absence of completely reliable negative examples, (2) the significant imbalance present in the datasets, and (3) the current modification site databases reveal an underrepresentation of certain modification types. For example, the dbPTM database lists 194 O-palmitoleoylation sites, including both experimentally validated and predicted instances. This is insufficient to support the training requirements of deep or machine learning.

Several studies have proposed effective solutions to these issues. For example, Wen et al. [178] constructed the PTMAtlas database, which comprises 397,524 PTM sites, and developed a DeepMVP model that outperforms existing tools across multiple PTM types. The construction of comprehensive and representative datasets remains a critical challenge in the recognition of PTMs. Negative instance data, which are typically more abundant than positive instance data, are often unreliable and contribute to the imbalanced datasets. To address this issue, further investigation into the application of semi-supervised learning and one-class learning approaches in dataset construction and model building processes is warranted. Transfer learning is a deep learning technique that leverages the knowledge gained from one task to improve performance on a related but distinct task. By pre-training a model on a large dataset and then fine-tuning it on a smaller task-specific dataset, transfer learning can mitigate the challenges posed by limited training data, thereby addressing the issue of underfitting that may arise when working with insufficient data for certain modified sites.

7.2. Interpretability

In recent years, deep learning techniques have shown considerable value in bioinformatics, particularly for the prediction of protein modification sites. However, these models typically utilize complex nonlinear network architectures, leading to a lack of interpretability of their internal mechanisms. This ‘black box’ nature undermines researchers’ trust in the predictive outcomes of these models and ignites discussions concerning the reliability of algorithmic decisions in biomedical applications. The establishment of interpretable models is crucial for studying modification sites. These approaches seek to identify key sequence features that determine modification sites, indicating not only where modifications occur but also why they occur at specific locations. Furthermore, developing interpretable methods that correlate sequence-level importance with three-dimensional protein structures can elucidate the spatial and physicochemical constraints of these modifications.

In [179], MIND-S utilized structural models and protein structure graph networks to improve the PTM prediction. Integrating protein structure-based features with other feature types can effectively enhance model interpretability. Ultimately, the core value of interpretability analysis lies in converting model predictions into experimentally verifiable scientific hypotheses, thereby exploring the relationship between modification sites and biological activity. These models are categorized into post hoc and ante-hoc interpretations. Post hoc interpretability encompasses example-, attribution-, latent semantics-, and rule-based approaches. In contrast, ante hoc interpretable learning prioritizes interpretability as a fundamental objective of model design and training. This involves adopting a transparent model architecture or imposing interpretability constraints during training. When developing interpretable models, it is essential to design self-transparent neural networks that maintain model performance and efficiency while selecting appropriate interpretability evaluation metrics.

7.3. Multi-PTM and PTM Crosstalk Prediction

Protein sequences may contain multiple modification sites that play crucial roles in protein functionality and stability. Yan et al. [179] proposed a multi-PTM prediction method called the MIND-S. MIND-S applies a multi-label strategy to predict multiple PTM types and their sites. Multi-PTM prediction aims to systematically identify multiple types of post-translational modifications and their specific sites that coexist on a single protein, thereby elucidating the complex co-modification landscape of proteins. Unlike traditional single-PTM prediction approaches, this strategy not only focuses on the independent distribution of common modifications, such as phosphorylation and acetylation, but also emphasizes the analysis of spatial proximity, temporal dynamics, and potential functional relationships among different modifications. By integrating high-throughput mass spectrometry data with diverse sequence, structural, and evolutionary features, and leveraging advanced computational methods such as deep learning, multi-PTM prediction facilitates the construction of a more holistic functional regulatory map of proteins. This approach lays a crucial foundation for subsequent investigations into the synergistic or antagonistic effects between modifications, commonly referred to as PTM crosstalk.

PTM crosstalk plays a crucial role in organisms and can be categorized into crosstalk within the same protein (intra-protein) and crosstalk between different proteins (inter-protein). Currently, some studies have focused on identifying PTM crosstalk pairs [180,181,182]. However, research on PTM crosstalk mechanisms and interaction networks is limited. Research has shown that changes in certain modification sites may lead to disease [183]. Therefore, studying the relationship between modification sites and diseases is crucial for a deeper understanding of disease progression, intervention mechanisms, and the development of precise therapeutic drugs. However, there is currently limited research utilizing machine learning or deep learning methods to explore the relationship between modification sites and diseases, making this direction worthy of further exploration.

8. Conclusions

The accurate identification of PTM is crucial for various applications, including drug development, disease diagnosis, and the understanding of molecular processes. Although traditional biological experimental methods are accurate, they are resource-intensive. Machine learning can efficiently process large datasets. However, its prediction accuracy may be influenced by various factors. Machine learning methods include traditional methods that require feature extraction and deep learning. This manuscript reviews current PTM site predictors based on machine learning, including datasets, feature extraction, classifiers, and evaluation metrics. This manuscript summarizes the existing methods and explores future research directions. Data play an important role in machine learning. Thus, future research should explore ways to solve imbalanced data and PU problems. The use of few-shot and transfer learning addresses the data scarcity and model complexity. Multi-label learning is conducive to further exploring protein modifications. The association between modification sites and diseases is worth exploring. We hope that our review and analysis will assist research related to PTM.

Author Contributions

Methodology, K.Q.; writing—original draft preparation, S.G.; writing—review and editing, K.Q. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China under grant number 62502242; the Henan Provincial Science and Technology Research Project, grant number: 252102220039, the Key Research Project Plan for Higher Education Institutions of Henan Province, grant number: 24A520027, and Science and Technology Research Project of Nangyang, grant number: 24KJGG059.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

PTM	Post-translational modifications
DL	Deep Learning
ML	Machine learning

References

Shrestha, P.; Kandel, J.; Tayara, H.; Chong, K.T. DL-SPhos: Prediction of serine phosphorylation sites using transformer language model. Comput. Biol. Med. 2024, 169, 107925. [Google Scholar] [CrossRef] [PubMed]
Liang, J.-Z.; Li, D.-H.; Xiao, Y.-C.; Shi, F.-J.; Zhong, T.; Liao, Q.-Y.; Wang, Y.; He, Q.-Y. LAFEM: A Scoring Model to Evaluate Functional Landscape of Lysine Acetylome. Mol. Cell. Proteom. MCP 2024, 23, 100700. [Google Scholar] [CrossRef] [PubMed]
Chang, K.W.; Gao, D.; Yan, J.D.; Lin, L.Y.; Cui, T.T.; Lu, S.M. Critical Roles of Protein Arginine Methylation in the Central Nervous System. Mol. Neurobiol. 2023, 60, 6060–6091. [Google Scholar] [CrossRef] [PubMed]
Dai, X.F.; Zhang, T.X.; Hua, D. Ubiquitination and SUMOylation: Protein homeostasis control over cancer. Epigenomics 2022, 14, 43–58. [Google Scholar] [CrossRef]
Masbuchin, A.N.; Rohman, M.S.; Liu, P.Y. Role of Glycosylation in Vascular Calcification. Int. J. Mol. Sci. 2021, 22, 9829. [Google Scholar] [CrossRef]
Wohlschlager, T.; Scheffler, K.; Forstenlehner, I.C.; Skala, W.; Senn, S.; Damoc, E.; Holzmann, J.; Huber, C.G. Native mass spectrometry combined with enzymatic dissection unravels glycoform heterogeneity of biopharmaceuticals. Nat. Commun. 2018, 9, 1713. [Google Scholar] [CrossRef]
Park, H.; Song, W.Y.; Cha, H.; Kim, T.Y. Development of an optimized sample preparation method for quantification of free fatty acids in food using liquid chromatography-mass spectrometry. Sci. Rep. 2021, 11, 5947. [Google Scholar] [CrossRef]
Slade, D.J.; Subramanian, V.; Fuhrmann, J.; Thompson, P.R. Chemical and Biological Methods to Detect Post-Translational Modifications of Arginine. Biopolymers 2014, 101, 133–143. [Google Scholar] [CrossRef]
Li, F.Y.; Dong, S.Y.; Leier, A.; Han, M.; Guo, X.D.; Xu, J.; Wang, X.Y.; Pan, S.R.; Jia, C.Z.; Zhang, Y.; et al. Positive-unlabeled learning in bioinformatics and computational biology: A brief review. Brief. Bioinform. 2022, 23, bbab461. [Google Scholar] [CrossRef]
Qiao, Y.H.; Zhu, X.L.; Gong, H.P. BERT-Kcr: Prediction of lysine crotonylation sites by a transfer learning method with pre-trained BERT models. Bioinformatics 2022, 38, 648–654. [Google Scholar] [CrossRef]
Li, Y.Y.; Liu, Z.; Liu, X.; Zhu, Y.H.; Fang, C.H.; Arif, M.; Qiu, W.R. A Systematic Review of Computational Methods for Protein Post-Translational Modification Site Prediction. Arch. Comput. Methods Eng. 2025, 1–21. [Google Scholar] [CrossRef]
Lussi, Y.C.; Magrane, M.; Martin, M.J.; Orchard, S. Searching and Navigating UniProt Databases. Curr. Protoc. 2023, 3, e700. [Google Scholar] [CrossRef] [PubMed]
Bairoch, A.; Bougueleret, L.; Altairac, S. The Universal Protein Resource (UniProt). Nucleic Acids Res. 2008, 36, D190–D195. [Google Scholar] [CrossRef]
Li, Z.Y.; Li, S.F.; Luo, M.Q.; Jhong, J.-H.; Li, W.S.; Yao, L.T.; Pang, Y.X.; Wang, Z.; Wang, R.L.; Ma, R.F.; et al. dbPTM in 2022: An updated database for exploring regulatory networks and functional associations of protein post-translational modifications. Nucleic Acids Res. 2022, 50, D471–D479. [Google Scholar] [CrossRef]
Lee, T.Y.; Huang, H.D.; Hung, J.H.; Huang, H.Y.; Yang, Y.S.O.; Wang, T.H. dbPTM: An information repository of protein post-translational modification. Nucleic Acids Res. 2006, 34, D622–D627. [Google Scholar] [CrossRef]
Zhang, W.Z.; Tan, X.D.; Lin, S.F.; Gou, Y.J.; Han, C.; Zhang, C.; Ning, W.S.; Wang, C.W.; Xue, Y. CPLM 4.0: An updated database with rich annotations for protein lysine modifications. Nucleic Acids Res. 2022, 50, D451–D459. [Google Scholar] [CrossRef]
Liu, Z.X.; Cao, J.; Gao, X.J.; Zhou, Y.H.; Wen, L.P.; Yang, X.J.; Yao, X.B.; Ren, J.A.; Xue, Y. CPLA 1.0: An integrated database of protein lysine acetylation. Nucleic Acids Res. 2011, 39, D1029–D1034. [Google Scholar] [CrossRef]
Liu, Z.X.; Wang, Y.B.; Gao, T.S.; Pan, Z.C.; Cheng, H.; Yang, Q.; Cheng, Z.Y.; Guo, A.Y.; Ren, J.; Xue, Y. CPLM: A database of protein lysine modifications. Nucleic Acids Res. 2014, 42, D531–D536. [Google Scholar] [CrossRef]
Xu, H.D.; Zhou, J.Q.; Lin, S.F.; Deng, W.K.; Zhang, Y.; Xue, Y. PLMD: An updated data resource of protein lysine modifications. J. Genet. Genom. 2017, 44, 243–250. [Google Scholar] [CrossRef]
Hornbeck, P.V.; Kornhauser, J.M.; Tkachev, S.; Zhang, B.; Skrzypek, E.; Murray, B.; Latham, V.; Sullivan, M. PhosphoSitePlus: A comprehensive resource for investigating the structure and function of experimentally determined post-translational modifications in man and mouse. Nucleic Acids Res. 2012, 40, D261–D270. [Google Scholar] [CrossRef]
Yu, K.; Wang, Y.; Zheng, Y.Q.; Liu, Z.K.; Zhang, Q.F.; Wang, S.Y.; Zhao, Q.; Zhang, X.L.; Li, X.X.; Xu, R.H. qPTM: An updated database for PTM dynamics in human, mouse, rat and yeast. Nucleic Acids Res. 2023, 51, D479–D487. [Google Scholar] [CrossRef] [PubMed]
Tung, C.W. PupDB: A database of pupylated proteins. BMC Bioinform. 2012, 13, 40. [Google Scholar] [CrossRef] [PubMed]
Duan, G.Y.; Li, X.; Köhn, M. The human DEPhOsphorylation database DEPOD: A 2015 update. Nucleic Acids Res. 2015, 43, D531–D535. [Google Scholar] [CrossRef] [PubMed]
Ma, J.F.; Li, Y.X.; Hou, C.Y.; Wu, C. O-GlcNAcAtlas: A database of experimentally identified O-GlcNAc sites and proteins. Glycobiology 2021, 31, 719–723. [Google Scholar] [CrossRef]
Dinkel, H.; Chica, C.; Via, A.; Gould, C.M.; Jensen, L.J.; Gibson, T.J.; Diella, F. Phospho.ELM: A database of phosphorylation sites-update 2011. Nucleic Acids Res. 2011, 39, D261–D267. [Google Scholar] [CrossRef]
Rao, R.S.P.; Zhang, N.; Xu, D.; Moller, I.M. CarbonylDB: A curated data-resource of protein carbonylation sites. Bioinformatics 2018, 34, 2518–2520. [Google Scholar] [CrossRef]
Ramasamy, P.; Turan, D.; Tichshenko, N.; Hulstaert, N.; Vandermarliere, E.; Vranken, W.; Martens, L. Scop3P: A Comprehensive Resource of Human Phosphosites within Their Full Context. J. Proteome Res. 2020, 19, 3478–3486. [Google Scholar] [CrossRef]
Hansen, J.E.; Lund, O.; Rapacki, K.; Brunak, S. O-GLYCBASE version 2.0: A revised database of O-glycosylated proteins. Nucleic Acids Res. 1997, 25, 278–282. [Google Scholar] [CrossRef]
Lee, T.Y.; Chen, Y.J.; Lu, C.T.; Ching, W.C.; Teng, Y.C.; Huang, H.D.; Chen, Y.J. dbSNO: A database of cysteine S-nitrosylation. Bioinformatics 2012, 28, 2293–2295. [Google Scholar] [CrossRef]
Li, Z.Y.; Chen, S.Y.; Jhong, J.H.; Pang, Y.X.; Huang, K.Y.; Li, S.F.; Lee, T.Y. UbiNet 2.0: A verified, classified, annotated and updated database of E3 ubiquitin ligase–substrate interactions. Database J. Biol. Databases Curation 2021, 2021, baab010. [Google Scholar] [CrossRef]
Wang, X.; Li, Y.; He, M.Q.; Kong, X.R.; Jiang, P.; Liu, X.; Diao, L.H.; Zhang, X.L.; Li, H.L.; Ling, X.P.; et al. UbiBrowser 2.0: A comprehensive resource for proteome-wide known and predicted ubiquitin ligase/deubiquitinase-substrate interactions in eukaryotic species. Nucleic Acids Res. 2022, 50, D719–D728. [Google Scholar] [CrossRef] [PubMed]
Durek, P.; Schmidt, R.; Heazlewood, J.L.; Jones, A.; MacLean, D.; Nagel, A.; Kersten, B.; Schulze, W.X. PhosPhAt: The Arabidopsis thaliana phosphorylation site database. An update. Nucleic Acids Res. 2010, 38, D828–D834. [Google Scholar] [CrossRef]
Lai, F.L.; Gao, F. Auto-Kla: A novel web server to discriminate lysine lactylation sites using automated machine learning. Brief. Bioinform. 2023, 24, bbad070. [Google Scholar] [CrossRef]
Wei, L.Y.; Xing, P.W.; Shi, G.T.; Ji, Z.L.; Zou, Q. Fast Prediction of Protein Methylation Sites Using a Sequence-Based Feature Selection Technique. IEEE-ACM Trans. Comput. Biol. Bioinform. 2019, 16, 1264–1273. [Google Scholar] [CrossRef]
Li, Z.T.; Fang, J.Y.; Wang, S.N.; Zhang, L.Y.; Chen, Y.Y.; Pian, C. Adapt-Kcr: A novel deep learning framework for accurate prediction of lysine crotonylation sites based on learning embedding features and attention architecture. Brief. Bioinform. 2022, 23, bbac037. [Google Scholar] [CrossRef]
Sua, J.N.; Lim, S.Y.; Yulius, M.H.; Su, X.T.; Yapp, E.K.Y.; Le, N.Q.K.; Yeh, H.Y.; Chua, M.C.H. Incorporating convolutional neural networks and sequence graph transform for identifying multilabel protein Lysine PTM sites. Chemom. Intell. Lab. Syst. 2020, 206, 104171. [Google Scholar] [CrossRef]
Lyu, X.R.; Li, S.H.; Jiang, C.Y.; He, N.N.; Chen, Z.; Zou, Y.; Li, L. DeepCSO: A Deep-Learning Network Approach to Predicting Cysteine S-Sulphenylation Sites. Front. Cell Dev. Biol. 2020, 8, 594587. [Google Scholar] [CrossRef]
Auliah, F.N.; Nilamyani, A.N.; Shoombuatong, W.; Alam, M.A.; Hasan, M.M.; Kurata, H. PUP-Fuse: Prediction of Protein Pupylation Sites by Integrating Multiple Sequence Representations. Int. J. Mol. Sci. 2021, 22, 2120. [Google Scholar] [CrossRef]
Bao, W.Z.; Yuan, C.A.; Zhang, Y.H.; Han, K.; Nandi, A.K.; Honig, B.; Huang, D.S. Mutli-Features Prediction of Protein Translational Modification Sites. IEEE-ACM Trans. Comput. Biol. Bioinform. 2018, 15, 1453–1460. [Google Scholar] [CrossRef] [PubMed]
Khalili, E.; Ramazi, S.; Ghanati, F.; Kouchaki, S. Predicting protein phosphorylation sites in soybean using interpretable deep tabular learning network. Brief. Bioinform. 2022, 23, bbac015. [Google Scholar] [CrossRef] [PubMed]
Li, W.Z.; Godzik, A. Cd-hit: A fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 2006, 22, 1658–1659. [Google Scholar] [CrossRef]
Yu, B.; Yu, Z.M.; Chen, C.; Ma, A.J.; Liu, B.Q.; Tian, B.G.; Ma, Q. DNNAce: Prediction of prokaryote lysine acetylation sites through deep neural networks with multi-information fusion. Chemom. Intell. Lab. Syst. 2020, 200, 103999. [Google Scholar] [CrossRef]
Arafat, M.E.; Ahmad, M.W.; Shovan, S.M.; Dehzangi, A.; Dipta, S.R.; Hasan, M.A.; Taherzadeh, G.; Shatabda, S.; Sharma, A. Accurately Predicting Glutarylation Sites Using Sequential Bi-Peptide-Based Evolutionary Features. Genes 2020, 11, 1023. [Google Scholar] [CrossRef] [PubMed]
Jamal, S.; Ali, W.; Nagpal, P.; Grover, A.; Grover, S. Predicting phosphorylation sites using machine learning by integrating the sequence, structure, and functional information of proteins. J. Transl. Med. 2021, 19, 218. [Google Scholar] [CrossRef] [PubMed]
Gao, Y.; Hao, W.L.; Gu, J.; Liu, D.W.; Fan, C.; Chen, Z.G.; Deng, L. PredPhos: An ensemble framework for structure-based prediction of phosphorylation sites. J. Biol. Res.-Thessalon. 2016, 23, S12. [Google Scholar] [CrossRef]
Chen, Z.; Pang, M.; Zhao, Z.X.; Li, S.N.; Miao, R.; Zhang, Y.F.; Feng, X.Y.; Feng, X.; Zhang, Y.X.; Duan, M.Y.; et al. Feature selection may improve deep neural networks for the bioinformatics problems. Bioinformatics 2020, 36, 1542–1552. [Google Scholar] [CrossRef]
Ning, Q.; Ma, Z.Q.; Zhao, X.W.; Yin, M.H. SSKM_Succ: A Novel Succinylation Sites Prediction Method Incorporating K-Means Clustering With a New Semi-Supervised Learning Algorithm. IEEE-ACM Trans. Comput. Biol. Bioinform. 2022, 19, 643–652. [Google Scholar] [CrossRef]
Jiang, M.; Cao, J.Z. Positive-Unlabeled Learning for Pupylation Sites Prediction. Biomed Res. Int. 2016, 2016, 4525786. [Google Scholar] [CrossRef]
Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
He, H.B.; Bai, Y.; Garcia, E.A.; Li, S.T. ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning. In Proceedings of the International Joint Conference on Neural Networks, Hong Kong, China, 1–8 June 2008; pp. 1322–1328. [Google Scholar]
Lu, Y.; Cheung, Y.M.; Tang, Y.Y. Hybrid Sampling with Bagging for Class Imbalance Learning. In Proceedings of the 20th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Auckland, New Zealand, 19–22 April 2016; pp. 14–26. [Google Scholar]
Seiffert, C.; Khoshgoftaar, T.M.; Van Hulse, J. Hybrid sampling for imbalanced data. Integr. Comput.-Aided Eng. 2009, 16, 193–210. [Google Scholar] [CrossRef]
Dongdong, L.; Ziqiu, C.; Bolu, W.; Zhe, W.; Hai, Y.; Wenli, D. Entropy-based hybrid sampling ensemble learning for imbalanced data. Int. J. Intell. Syst. 2021, 36, 3039–3067. [Google Scholar] [CrossRef]
Wang, M.H.; Cui, X.W.; Yu, B.; Chen, C.; Ma, Q.; Zhou, H.Y. SulSite-GTB: Identification of protein S-sulfenylation sites by fusing multiple feature information and gradient tree boosting. Neural Comput. Appl. 2020, 32, 13843–13862. [Google Scholar] [CrossRef]
Wang, M.H.; Song, L.L.; Zhang, Y.Q.; Gao, H.L.; Yan, L.; Yu, B. Malsite-Deep: Prediction of protein malonylation sites through deep learning and multi-information fusion based on NearMiss-2 strategy. Knowl.-Based Syst. 2022, 240, 108191. [Google Scholar] [CrossRef]
Wilson, D.L. Asymptotic Properties of Nearest Neighbor Rules Using Edited Data. IEEE Trans. Syst. Man Cybern. 1972, SMC-2, 408–421. [Google Scholar] [CrossRef]
Ijaz, M.F.; Attique, M.; Son, Y. Data-Driven Cervical Cancer Prediction Model with Outlier Detection and Over-Sampling Methods. Sensors 2020, 20, 2809. [Google Scholar] [CrossRef]
Mbunge, E.; Millham, R.C.; Sibiya, M.N.; Chemhaka, G.; Takavarasha, S.; Muchemwa, B.; Dzinamarira, T. Implementation of ensemble machine learning classifiers to predict diarrhoea with SMOTEENN, SMOTE, and SMOTETomek class imbalance approaches. In Proceedings of the Conference on Information-Communications-Technology-and-Society (ICTAS), Durban, South Africa, 8–9 March 2023; pp. 90–95. [Google Scholar]
Khan, S.H.; Hayat, M.; Bennamoun, M.; Sohel, F.A.; Togneri, R. Cost-Sensitive Learning of Deep Feature Representations From Imbalanced Data. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 3573–3587. [Google Scholar] [CrossRef]
Yuan, Z.W.; Zhao, P. An Improved Ensemble Learning for Imbalanced Data Classification. In Proceedings of the IEEE 8th Joint International Information Technology and Artificial Intelligence Conference (ITAIC), Chongqing, China, 24–26 May 2019; pp. 408–411. [Google Scholar]
Hu, X.S.; Zhang, R.J. Clustering-based Subset Ensemble Learning Method for Imbalanced Data. In Proceedings of the International Conference on Machine Learning and Cybernetics (ICMLC), Tianjin, China, 14–17 July 2013; pp. 35–39. [Google Scholar]
Hayashi, T.; Fujita, H. One-class ensemble classifier for data imbalance problems. Appl. Intell. 2022, 52, 17073–17089. [Google Scholar] [CrossRef]
Dou, L.J.; Yang, F.L.; Xu, L.; Zou, Q. A comprehensive review of the imbalance classification of protein post-translational modifications. Brief. Bioinform. 2021, 22, bbab089. [Google Scholar] [CrossRef]
Branco, P.; Torgo, L.; Ribeiro, R.P. A Survey of Predictive Modeling on Im balanced Domains. Acm Comput. Surv. 2016, 49, 31. [Google Scholar]
Kaur, H.; Pannu, H.S.; Malhi, A.K. A Systematic Review on Imbalanced Data Challenges in Machine Learning: Applications and Solutions. Acm Comput. Surv. 2019, 52, 79. [Google Scholar] [CrossRef]
Wang, M.; Yang, J.; Liu, G.P.; Xu, Z.J.; Chou, K.C. Weighted-support vector machines for predicting membrane protein types based on pseudo-amino acid composition. Protein Eng. Des. Sel. 2004, 17, 509–516. [Google Scholar] [CrossRef]
Lin, C.-F.; Wang, S.-D. Fuzzy support vector machines. IEEE Trans. Neural Netw. 2002, 13, 464–471. [Google Scholar]
Ju, Z.; Wang, S.Y. Prediction of lysine formylation sites using the composition of k-spaced amino acid pairs via Chou’s 5-steps rule and general pseudo components. Genomics 2020, 112, 859–866. [Google Scholar] [CrossRef] [PubMed]
Zhou, Z.H.; Liu, X.Y. Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Trans. Knowl. Data Eng. 2006, 18, 63–77. [Google Scholar] [CrossRef]
Seiffert, C.; Khoshgoftaar, T.M.; Van Hulse, J.; Napolitano, A. RUSBoost: A Hybrid Approach to Alleviating Class Imbalance. IEEE Trans. Syst. Man Cybern. Part A Syst. Hum. 2010, 40, 185–197. [Google Scholar] [CrossRef]
Jia, C.Z.; Zuo, Y.; Zou, Q. O-GlcNAcPRED-II: An integrated classification algorithm for identifying O-GlcNAcylation sites based on fuzzy undersampling and a K-means PCA oversampling technique. Bioinformatics 2018, 34, 2029–2036. [Google Scholar] [CrossRef]
Wu, X.Y.; Srihari, R.; Zheng, Z.H. Document representation for one-class SVM. In Machine Learning: ECML 2004; Boulicaut, J.F., Esposito, F., Giannoti, F., Pedreschi, D., Eds.; Springer: Berlin/Heidelberg, Germany, 2004; Volume 3201, pp. 489–500. [Google Scholar]
Islam, S.; Mugdha, S.B.; Dipta, S.R.; Arafat, M.E.; Shatabda, S.; Alinejad-Rokny, H.; Dehzangi, I. MethEvo: An accurate evolutionary information-based methylation site predictor. Neural Comput. Appl. 2022, 36, 201–212. [Google Scholar] [CrossRef]
Huang, K.Y.; Hung, F.Y.; Kao, H.J.; Lau, H.H.; Weng, S.L. iDPGK: Characterization and identification of lysine phosphoglycerylation sites based on sequence-based features. BMC Bioinform. 2020, 21, 568. [Google Scholar] [CrossRef]
Sahu, S.S.; Panda, G. A novel feature representation method based on Chou’s pseudo amino acid composition for protein structural class prediction. Comput. Biol. Chem. 2010, 34, 320–327. [Google Scholar] [CrossRef]
Huang, K.Y.; Hsu, J.B.K.; Lee, T.Y. Characterization and Identification of Lysine Succinylation Sites based on Deep Learning Method. Sci. Rep. 2019, 9, 16175. [Google Scholar] [CrossRef]
Jiang, P.R.; Ning, W.S.; Shi, Y.S.; Liu, C.; Mo, S.J.; Zhou, H.R.; Liu, K.D.; Guo, Y.P. FSL-Kla: A few-shot learning-based multi-feature hybrid system for lactylation site prediction. Comput. Struct. Biotechnol. J. 2021, 19, 4497–4509. [Google Scholar] [CrossRef] [PubMed]
Suo, S.B.; Qiu, J.D.; Shi, S.P.; Sun, X.Y.; Huang, S.Y.; Chen, X.; Liang, R.P. Position-Specific Analysis and Prediction for Protein Lysine Acetylation Based on Multiple Features. PLoS ONE 2012, 7, e49108. [Google Scholar] [CrossRef] [PubMed]
Shen, H.B.; Yang, J.; Chou, K.C. Fuzzy KNN for predicting membrane protein types from pseudo-amino acid composition. J. Theor. Biol. 2006, 240, 9–13. [Google Scholar] [CrossRef]
Gao, J.J.; Thelen, J.J.; Dunker, A.K.; Xu, D. Musite, a Tool for Global Prediction of General and Kinase-specific Phosphorylation Sites. Mol. Cell. Proteom. 2010, 9, 2586–2600. [Google Scholar] [CrossRef]
Shen, J.W.; Zhang, J.; Luo, X.M.; Zhu, W.L.; Yu, K.Q.; Chen, K.X.; Li, Y.X.; Jiang, H.L. Predicting protein-protein interactions based only on sequences information. Proc. Natl. Acad. Sci. USA 2007, 104, 4337–4341. [Google Scholar] [CrossRef]
Saravanan, V.; Gautham, N. Harnessing Computational Biology for Exact Linear B-Cell Epitope Prediction: A Novel Amino Acid Composition-Based Feature Descriptor. OMICS J. Integr. Biol. 2015, 19, 648–658. [Google Scholar] [CrossRef]
Park, K.J.; Kanehisa, M. Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs. Bioinformatics 2003, 19, 1656–1663. [Google Scholar] [CrossRef]
Keskin, O.; Bahar, I.; Badretdinov, A.Y.; Ptitsyn, O.B.; Jernigan, R.L. Empirical solvent-mediated potentials hold for both intra-molecular and inter-molecular inter-residue interactions. Protein Sci. 1998, 7, 2578–2586. [Google Scholar] [CrossRef]
Liang, S.D.; Grishin, N.V. Effective scoring function for protein sequence design. Proteins Struct. Funct. Bioinform. 2004, 54, 271–281. [Google Scholar] [CrossRef]
Chan, C.H.; Liang, H.K.; Hsiao, N.W.; Ko, M.T.; Lyu, P.C.; Hwang, J.K. Relationship between local structural entropy and protein thermostability. Proteins Struct. Funct. Bioinform. 2004, 57, 684–691. [Google Scholar] [CrossRef]
Tang, Y.R.; Chen, Y.Z.; Canchaya, C.A.; Zhang, Z.D. GANNPhos: A new phosphorylation site predictor based on a genetic algorithm integrated neural network. Protein Eng. Des. Sel. 2007, 20, 405–412. [Google Scholar] [CrossRef] [PubMed]
Xu, Y.; Wang, X.B.; Wang, Y.C.; Tian, Y.J.; Shao, X.J.; Wu, L.Y.; Deng, N.Y. Prediction of posttranslational modification sites from amino acid sequences with kernel methods. J. Theor. Biol. 2014, 344, 78–87. [Google Scholar] [CrossRef] [PubMed]
Lee, T.Y.; Lin, Z.Q.; Hsieh, S.J.; Bretaña, N.A.; Lu, C.T. Exploiting maximal dependence decomposition to identify conserved motifs from a group of aligned signal sequences. Bioinformatics 2011, 27, 1780–1787. [Google Scholar] [CrossRef] [PubMed]
Kawashima, S.; Kanehisa, M. AAindex: Amino acid index database. Nucleic Acids Res. 2000, 28, 374. [Google Scholar] [CrossRef]
Li, F.Y.; Li, C.; Wang, M.J.; Webb, G.I.; Zhang, Y.; Whisstock, J.C.; Song, J.N. GlycoMine: A machine learning-based approach for predicting N-, C- and O-linked glycosylation in the human proteome. Bioinformatics 2015, 31, 1411–1419. [Google Scholar] [CrossRef]
Gong, W.M.; Zhou, D.H.; Ren, Y.L.; Wang, Y.J.; Zuo, Z.X.; Shen, Y.P.; Xiao, F.F.; Zhu, Q.; Hong, A.L.; Zhou, X.; et al. PepCyber:PPEP:: A database of human protein-protein interactions mediated by phosphoprotein-binding domains. Nucleic Acids Res. 2008, 36, D679–D683. [Google Scholar] [CrossRef]
Wagner, M.; Adamczak, R.; Porollo, A.; Meller, J. Linear regression models for solvent accessibility prediction in proteins. J. Comput. Biol. 2005, 12, 355–369. [Google Scholar] [CrossRef]
Tomii, K.; Kanehisa, M. Analysis of amino acid indices and mutation matrices for sequence comparison and structure prediction of proteins. Protein Eng. 1996, 9, 27–36. [Google Scholar] [CrossRef]
Dubchak, I.; Muchnik, I.; Holbrook, S.R.; Kim, S.H. Prediction of protein-folding class using global description of amino-acid-sequence. Proc. Natl. Acad. Sci. USA 1995, 92, 8700–8704. [Google Scholar] [CrossRef]
Faraggi, E.; Xue, B.; Zhou, Y.Q. Improving the prediction accuracy of residue solvent accessibility and real-value backbone torsion angles of proteins by guided-learning through a two-layer neural network. Proteins Struct. Funct. Bioinform. 2009, 74, 847–856. [Google Scholar] [CrossRef]
Kabsch, W.; Sander, C. Dictionary of protein secondary structure—Pattern-recognition of hydrogen-bonded and geometrical features. Biopolymers 1983, 22, 2577–2637. [Google Scholar] [CrossRef]
López, Y.; Dehzangi, A.; Lal, S.P.; Taherzadeh, G.; Michaelson, J.; Sattar, A.; Tsunoda, T.; Sharma, A. SucStruct: Prediction of succinylated lysine residues by using structural properties of amino acids. Anal. Biochem. 2017, 527, 24–32. [Google Scholar] [CrossRef]
López, Y.; Sharma, A.; Dehzangi, A.; Lal, S.P.; Taherzadeh, G.; Sattar, A.; Tsunoda, T. Success: Evolutionary and structural properties of amino acids prove effective for succinylation site prediction. BMC Genom. 2018, 19, 923. [Google Scholar] [CrossRef]
Ward, J.J.; McGuffin, L.J.; Bryson, K.; Buxton, B.F.; Jones, D.T. The DISOPRED server for the prediction of protein disorder. Bioinformatics 2004, 20, 2138–2139. [Google Scholar] [CrossRef] [PubMed]
Holland, R.C.G.; Down, T.A.; Pocock, M.; Prlic, A.; Huen, D.; James, K.; Foisy, S.; Draeger, A.; Yates, A.; Heuer, M.; et al. BioJava: An open-source framework for bioinformatics. Bioinformatics 2008, 24, 2096–2097. [Google Scholar] [CrossRef] [PubMed]
Obradovic, Z.; Peng, K.; Vucetic, S.; Radivojac, P.; Dunker, A.K. Exploiting heterogeneous sequence properties improves prediction of protein disorder. Proteins Struct. Funct. Bioinform. 2005, 61, 176–182. [Google Scholar] [CrossRef] [PubMed]
Heffernan, R.; Paliwal, K.; Lyons, J.; Dehzangi, A.; Sharma, A.; Wang, J.H.; Sattar, A.; Yang, Y.D.; Zhou, Y.Q. Improving prediction of secondary structure, local backbone angles, and solvent accessible surface area of proteins by iterative deep learning. Sci. Rep. 2015, 5, 11476. [Google Scholar] [CrossRef]
Islam, M.M.; Saha, S.; Rahman, M.M.; Shatabda, S.; Farid, D.M.; Dehzangi, A. iProtGly-SS: Identifying protein glycation sites using sequence and structure based features. Proteins Struct. Funct. Bioinform. 2018, 86, 777–789. [Google Scholar] [CrossRef]
Sharma, A.; Lyons, J.; Dehzangi, A.; Paliwal, K.K. A feature extraction technique using bi-gram probabilities of position specific scoring matrix for protein fold recognition. J. Theor. Biol. 2013, 320, 41–46. [Google Scholar] [CrossRef]
Ashburner, M.; Ball, C.A.; Blake, J.A.; Botstein, D.; Butler, H.; Cherry, J.M.; Davis, A.P.; Dolinski, K.; Dwight, S.S.; Eppig, J.T.; et al. Gene Ontology: Tool for the unification of biology. Nat. Genet. 2000, 25, 25–29. [Google Scholar] [CrossRef]
Hunter, S.; Jones, P.; Mitchell, A.; Apweiler, R.; Attwood, T.K.; Bateman, A.; Bernard, T.; Binns, D.; Bork, P.; Burge, S.; et al. InterPro in 2011: New developments in the family and domain prediction database. Nucleic Acids Res. 2012, 40, D306–D312. [Google Scholar] [CrossRef]
Kanehisa, M.; Goto, S.; Sato, Y.; Furumichi, M.; Tanabe, M. KEGG for integration and interpretation of large-scale molecular data sets. Nucleic Acids Res. 2012, 40, D109–D114. [Google Scholar] [CrossRef]
Finn, R.D.; Tate, J.; Mistry, J.; Coggill, P.C.; Sammut, S.J.; Hotz, H.R.; Ceric, G.; Forslund, K.; Eddy, S.R.; Sonnhammer, E.L.L.; et al. The Pfam protein families database. Nucleic Acids Res. 2008, 36, D281–D288. [Google Scholar] [CrossRef]
Franceschini, A.; Szklarczyk, D.; Frankild, S.; Kuhn, M.; Simonovic, M.; Roth, A.; Lin, J.Y.; Minguez, P.; Bork, P.; von Mering, C.; et al. STRING v9.1: Protein-protein interaction networks, with increased coverage and integration. Nucleic Acids Res. 2013, 41, D808–D815. [Google Scholar] [CrossRef]
Weng, S.L.; Huang, K.Y.; Kaunang, F.J.; Huang, C.H.; Kao, H.J.; Chang, T.H.; Wang, H.Y.; Lu, J.J.; Lee, T.Y. Investigation and identification of protein carbonylation sites based on positionspecific amino acid composition and physicochemical features. BMC Bioinform. 2017, 18, 66. [Google Scholar] [CrossRef] [PubMed]
Celniker, G.; Nimrod, G.; Ashkenazy, H.; Glaser, F.; Martz, E.; Mayrose, I.; Pupko, T.; Ben-Tal, N. ConSurf: Using Evolutionary Data to Raise Testable Hypotheses about Protein Function. Isr. J. Chem. 2013, 53, 199–206. [Google Scholar] [CrossRef]
Armon, A.; Graur, D.; Ben-Tal, N. ConSurf: An algorithmic tool for the identification of functional regions in proteins by surface mapping of phylogenetic information. J. Mol. Biol. 2001, 307, 447–463. [Google Scholar] [CrossRef] [PubMed]
Shen, H.B.; Chou, K.C. Nuc-PLoc: A new web-server for predicting protein subnuclear localization by fusing PseAA composition and PsePSSM. Protein Eng. Des. Sel. 2007, 20, 561–567. [Google Scholar] [CrossRef]
Alkuhlani, A.; Gad, W.; Roushdy, M.; Voskoglou, M.G.; Salem, A.B.M. PTG-PLM: Predicting Post-Translational Glycosylation and Glycation Sites Using Protein Language Models and Deep Learning. Axioms 2022, 11, 469. [Google Scholar] [CrossRef]
Ahmed, E.; Michael, H.; Christian, D.; Ghalia, R.; Yu, W.; Llion, J.; Tom, G.; Tamas, F.; Christoph, A.; Martin, S.; et al. ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Deep Learning and High Performance Computing. arXiv 2021, arXiv:2007.06225. [Google Scholar]
Rives, A.; Meier, J.; Sercu, T.; Goyal, S.; Lin, Z.M.; Liu, J.S.; Guo, D.M.; Ott, M.; Zitnick, C.L.; Ma, J.; et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl. Acad. Sci. USA 2021, 118, e2016239118. [Google Scholar] [CrossRef]
Rao, R.; Bhattacharya, N.; Thomas, N.; Duan, Y.; Chen, X.; Canny, J.; Abbeel, P.; Song, Y.S. Evaluating Protein Transfer Learning with TAPE. Adv. Neural Inf. Process. Syst. 2019, 32, 9689–9701. [Google Scholar] [PubMed]
Jacob, D.; Ming-Wei, C.; Kenton, L.; Kristina, T. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv 2019, arXiv:1810.04805. [Google Scholar] [CrossRef]
Lan, Z.C.M.; Goodman, S.; Gimpel, K.; Sharma, P.; Soricut, R. Albert: A lite bert for self-supervised learning of language representations. arXiv 2019, arXiv:1909.11942. [Google Scholar]
Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov, R.; Le, Q.V. XLNet: Generalized Autoregressive Pretraining for Language Understanding. arXiv 2020, arXiv:1906.08237. [Google Scholar] [CrossRef]
Wang, H.F.; Wang, Z.; Li, Z.Y.; Lee, T.Y. Incorporating Deep Learning With Word Embedding to Identify Plant Ubiquitylation Sites. Front. Cell Dev. Biol. 2020, 8, 572195. [Google Scholar] [CrossRef]
Yu, K.; Zhang, Q.F.; Liu, Z.K.; Du, Y.M.; Gao, X.J.; Zhao, Q.; Cheng, H.; Li, X.X.; Liu, Z.X. Deep learning based prediction of reversible HAT/HDAC-specific lysine acetylation. Brief. Bioinform. 2020, 21, 1798–1805. [Google Scholar] [CrossRef]
Liu, M.; Zhu, F. LkaM-PTM: Predicting PTM sites through multimodal protein features from capturing cross-field information. Artif. Intell. Med. 2026, 171, 103297. [Google Scholar] [CrossRef]
Varga, J.K.; Ovchinnikov, S.; Schueler-Furman, O. actifpTM: A refined confidence metric of AlphaFold2 predictions involving flexible regions. Bioinformatics 2025, 41, btaf107. [Google Scholar] [CrossRef]
Li, S.H.; Zhang, J.; Zhao, Y.W.; Dad, F.Y.; Ding, H.; Chen, W.; Tang, H. iPhoPred: A Predictor for Identifying Phosphorylation Sites in Human Protein. IEEE Access 2019, 7, 177517–177528. [Google Scholar] [CrossRef]
Xu, Y.; Ding, Y.X.; Ding, J.; Wu, L.Y.; Xue, Y. Mal-Lys: Prediction of lysine malonylation sites in proteins integrated sequence-based features with mRMR feature selection. Sci. Rep. 2016, 6, 38318. [Google Scholar] [CrossRef]
Zhang, N.; Zhou, Y.; Huang, T.; Zhang, Y.C.; Li, B.Q.; Chen, L.; Cai, Y.D. Discriminating between Lysine Sumoylation and Lysine Acetylation Using mRMR Feature Selection and Analysis. PLoS ONE 2014, 9, e107464. [Google Scholar] [CrossRef] [PubMed]
Ma, X.; Guo, J.; Sun, X. Sequence-Based Prediction of RNA-Binding Proteins Using Random Forest with Minimum Redundancy Maximum Relevance Feature Selection. Biomed Res. Int. 2015, 2015, 425810. [Google Scholar] [CrossRef] [PubMed]
Peker, M.; Sen, B.; Delen, D. Computer-Aided Diagnosis of Parkinson’s Disease Using Complex-Valued Neural Networks and mRMR Feature Selection Algorithm. J. Healthc. Eng. 2015, 6, 281–302. [Google Scholar] [CrossRef] [PubMed]
He, S.D.; Ye, X.C.; Sakurai, T.; Zou, Q. MRMD3.0: A Python Tool and Webserver for Dimensionality Reduction and Data Visualization via an Ensemble Strategy. J. Mol. Biol. 2023, 435, 168116. [Google Scholar] [CrossRef]
Yu, J.L.; Shi, S.P.; Zhang, F.; Chen, G.D.; Cao, M. PredGly: Predicting lysine glycation sites for Homo sapiens based on XGboost feature optimization. Bioinformatics 2019, 35, 2749–2756. [Google Scholar] [CrossRef]
Chen, T.Q.; Guestrin, C.; Assoc Comp, M. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
Xu, Y.; Li, L.; Ding, J.; Wu, L.Y.; Mai, G.Q.; Zhou, F.F. Gly-PseAAC: Identifying protein lysine glycation through sequences. Gene 2017, 602, 1–7. [Google Scholar] [CrossRef]
Ning, Q.; Ma, Z.Q.; Zhao, X.W. dForml(KNN)-PseAAC: Detecting formylation sites from protein sequences using K-nearest neighbor algorithm via Chou’s 5-step rule and pseudo components. J. Theor. Biol. 2019, 470, 43–49. [Google Scholar] [CrossRef]
Dosset, P.; Rassam, P.; Fernandez, L.; Espenel, C.; Rubinstein, E.; Margeat, E.; Milhiet, P.E. Automatic detection of diffusion modes within biological membranes using back-propagation neural network. BMC Bioinform. 2016, 17, 197. [Google Scholar] [CrossRef]
Butt, A.H.; Khan, Y.D. Prediction of S-Sulfenylation Sites Using Statistical Moments Based Features via CHOU’S 5-Step Rule. Int. J. Pept. Res. Ther. 2020, 26, 1291–1301. [Google Scholar] [CrossRef]
Malebary, S.J.; Rehman, M.S.U.; Khan, Y.D. iCrotoK-PseAAC: Identify lysine crotonylation sites by blending position relative statistical features according to the Chou’s 5-step rule. PLoS ONE 2019, 14, e0223993. [Google Scholar] [CrossRef]
Opitz, D.; Maclin, R. Popular ensemble methods: An empirical study. J. Artif. Intell. Res. 1999, 11, 169–198. [Google Scholar] [CrossRef]
Rokach, L. Ensemble-based classifiers. Artif. Intell. Rev. 2010, 33, 1–39. [Google Scholar] [CrossRef]
Hasan, M.M.; Guo, D.J.; Kurata, H. Computational identification of protein S-sulfenylation sites by incorporating the multiple sequence features information. Mol. Biosyst. 2017, 13, 2545–2550. [Google Scholar] [CrossRef] [PubMed]
Shi, M.H.; Lin, F.X.; Qian, Y.; Dou, L. Research of Imbalanced Classification Based on Cascade Forest. In Proceedings of the IEEE International Conference on Progress in Informatics and Computing (IEEE PIC), Shanghai, China, 17–19 December 2021; pp. 29–33. [Google Scholar]
Chu, Y.Y.; Kaushik, A.C.; Wang, X.G.; Wang, W.; Zhang, Y.F.; Shan, X.Q.; Salahub, D.R.; Xiong, Y.; Wei, D.Q. DTI-CDF: A cascade deep forest model towards the prediction of drug-target interactions based on hybrid features. Brief. Bioinform. 2021, 22, 451–462. [Google Scholar] [CrossRef] [PubMed]
Qian, Y.; Ye, S.S.; Zhang, Y.; Zhang, J.M. SUMO-Forest: A Cascade Forest based method for the prediction of SUMOylation sites on imbalanced data. Gene 2020, 741, 144536. [Google Scholar] [CrossRef]
Rao, H.; Shi, X.Z.; Rodrigue, A.K.; Feng, J.J.; Xia, Y.C.; Elhoseny, M.; Yuan, X.H.; Gu, L.C. Feature selection based on artificial bee colony and gradient boosting decision tree. Appl. Soft Comput. 2019, 74, 634–642. [Google Scholar] [CrossRef]
Friedman, J.H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
He, F.; Wang, R.; Gao, Y.X.; Wang, D.L.; Yu, Y.; Xu, D.; Zhao, X.W. Protein Ubiquitylation and Sumoylation Site Prediction Based on Ensemble and Transfer Learning. In Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine (BIBM), San Diego, CA, USA, 18–21 November 2019; pp. 117–123. [Google Scholar]
Zhang, Y.J.; Xie, R.P.; Wang, J.W.; Leier, A.; Marquez-Lago, T.T.; Akutsu, T.; Webb, G.I.; Chou, K.C.; Song, J.N. Computational analysis and prediction of lysine malonylation sites by exploiting informative features in an integrative machine-learning framework. Brief. Bioinform. 2019, 20, 2185–2199. [Google Scholar] [CrossRef]
Wang, D.L.; Liu, D.P.; Yuchi, J.K.; He, F.; Jiang, Y.X.; Cai, S.T.; Li, J.Y.; Xu, D. MusiteDeep: A deep-learning based webserver for protein post-translational modification site prediction and visualization. Nucleic Acids Res. 2020, 48, W140–W146. [Google Scholar] [CrossRef]
Zhao, Y.M.; He, N.N.; Chen, Z.; Li, L. Identification of Protein Lysine Crotonylation Sites by a Deep Learning Framework With Convolutional Neural Networks. IEEE Access 2020, 8, 14244–14252. [Google Scholar] [CrossRef]
Wei, X.L.; Sha, Y.T.; Zhao, Y.M.; He, N.N.; Li, L. DeepKcrot: A Deep-Learning Architecture for General and Species-Specific Lysine Crotonylation Site Prediction. IEEE Access 2021, 9, 49504–49513. [Google Scholar] [CrossRef]
Xiu, Q.X.; Li, D.C.; Li, H.L.; Wang, N.; Ding, C. Prediction Method for Lysine Acetylation Sites Based on LSTM Network. In Proceedings of the 7th IEEE International Conference on Computer Science and Network Technology (ICCSNT), Dalian, China, 19–20 October 2019; pp. 179–182. [Google Scholar]
Li, A.; Deng, Y.W.; Tan, Y.; Chen, M. A Transfer Learning-Based Approach for Lysine Propionylation Prediction. Front. Physiol. 2021, 12, 658633. [Google Scholar] [CrossRef] [PubMed]
Zhao, Q.; Ma, J.Q.; Wang, Y.; Xie, F.; Lv, Z.B.; Xu, Y.Q.; Shi, H.; Han, K. Mul-SNO: A Novel Prediction Tool for S-Nitrosylation Sites Based on Deep Learning Methods. IEEE J. Biomed. Health Inform. 2022, 26, 2379–2387. [Google Scholar] [CrossRef] [PubMed]
Liu, Y.; Ye, C.F.; Lin, C.; Wang, Q.; Zhou, J.X.; Zhu, M. Semi-ssPTM: A Web Server for Species-Specific Lysine Post-Translational Modification Site Prediction by Semi-Supervised Domain Adaptation. IEEE Trans. Instrum. Meas. 2024, 73, 2523410. [Google Scholar] [CrossRef]
Ning, W.S.; Xu, H.D.; Jiang, P.R.; Cheng, H.; Deng, W.K.; Guo, Y.P.; Xue, Y. HybridSucc: A Hybrid-learning Architecture for General and Species-specific Succinylation Site Prediction. Genom. Proteom. Bioinform. 2020, 18, 194–207. [Google Scholar] [CrossRef]
Chen, Z.; Zhao, P.; Li, F.Y.; Leier, A.; Marquez-Lago, T.T.; Webb, G.I.; Baggag, A.; Bensmail, H.; Song, J. PROSPECT: A web server for predicting protein histidine phosphorylation sites. J. Bioinform. Comput. Biol. 2020, 18, 2050018. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. In Proceedings of the 31st Annual Conference on Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Meng, L.K.; Chen, X.J.; Cheng, K.; Chen, N.J.; Zheng, Z.T.; Wang, F.Z.; Sun, H.Y.; Wong, K.C. TransPTM: A transformer-based model for non-histone acetylation site prediction. Brief. Bioinform. 2024, 25, bbae219. [Google Scholar] [CrossRef]
Liang, Y.Y.; Li, M.W. A deep learning model for prediction of lysine crotonylation sites by fusing multi-features based on multi-head self-attention mechanism. Sci. Rep. 2025, 15, 18940. [Google Scholar] [CrossRef]
Xu, D.L.; Zhu, Y.F.; Xu, Q.; Liu, Y.H.; Chen, Y.; Zou, Y.; Li, L. DTL-NeddSite: A Deep-Transfer Learning Architecture for Prediction of Lysine Neddylation Sites. IEEE Access 2023, 11, 51798–51809. [Google Scholar] [CrossRef]
Soylu, N.N.; Sefer, E. DeepPTM: Protein Post-translational Modification Prediction from Protein Sequences by Combining Deep Protein Language Model with Vision Transformers. Curr. Bioinform. 2024, 19, 810–824. [Google Scholar] [CrossRef]
Lv, H.; Dao, F.Y.; Guan, Z.X.; Yang, H.; Li, Y.W.; Lin, H. Deep-Kcr: Accurate detection of lysine crotonylation sites using deep learning method. Brief. Bioinform. 2021, 22, bbaa255. [Google Scholar] [CrossRef] [PubMed]
Xu, Y.; Ding, J.; Wu, L.Y. iSulf-Cys: Prediction of S-sulfenylation Sites in Proteins with Physicochemical Properties of Amino Acids. PLoS ONE 2016, 11, e0154237. [Google Scholar] [CrossRef] [PubMed]
Liu, S.; Xue, C.; Fang, Y.; Chen, G.; Peng, X.J.; Zhou, Y.; Chen, C.; Liu, G.Q.; Gu, M.H.; Wang, K.; et al. Global Involvement of Lysine Crotonylation in Protein Modification and Transcription Regulation in Rice. Mol. Cell. Proteom. 2018, 17, 1922–1936. [Google Scholar] [CrossRef] [PubMed]
Sun, H.J.; Liu, X.W.; Li, F.F.; Li, W.; Zhang, J.; Xiao, Z.X.; Shen, L.L.; Li, Y.; Wang, F.L.; Yang, J.G. First comprehensive proteome analysis of lysine crotonylation in seedling leaves of Nicotiana tabacum. Sci. Rep. 2017, 7, 3013. [Google Scholar] [CrossRef]
Liu, K.D.; Yuan, C.C.; Li, H.L.; Chen, K.Y.; Lu, L.S.; Shen, C.J.; Zheng, X.L. A qualitative proteome-wide lysine crotonylation profiling of papaya (Carica papaya L.). Sci. Rep. 2018, 8, 8230. [Google Scholar] [CrossRef]
Li, S.H.; Yu, K.; Wu, G.D.; Zhang, Q.F.; Wang, P.Q.; Zheng, J.; Liu, Z.X.; Wang, J.C.; Gao, X.J.; Cheng, H. pCysMod: Prediction of Multiple Cysteine Modifications Based on Deep Learning Framework. Front. Cell Dev. Biol. 2021, 9, 617366. [Google Scholar] [CrossRef]
Al-barakati, H.J.; Saigo, H.; Newman, R.H.; Dukka, B.K. RF-GlutarySite: A random forest based predictor for glutarylation sites. Mol. Omics 2019, 15, 189–204. [Google Scholar] [CrossRef]
Dou, L.J.; Li, X.L.; Zhang, L.C.; Xiang, H.K.; Xu, L. iGlu_AdaBoost: Identification of Lysine Glutarylation Using the AdaBoost Classifier. J. Proteome Res. 2021, 20, 191–201. [Google Scholar] [CrossRef]
Chung, C.R.; Chang, Y.P.; Hsu, Y.L.; Chen, S.Y.; Wu, L.C.; Horng, J.T.; Lee, T.Y. Incorporating hybrid models into lysine malonylation sites prediction on mammalian and plant proteins. Sci. Rep. 2020, 10, 10541. [Google Scholar] [CrossRef]
Liu, Y.; Li, A.; Zhao, X.M.; Wang, M.H. DeepTL-Ubi: A novel deep transfer learning method for effectively predicting ubiquitination sites of multiple species. Methods 2021, 192, 103–111. [Google Scholar] [CrossRef]
Long, H.X.; Sun, Z.; Li, M.Z.; Fu, H.Y.; Lin, M.C. Predicting Protein Phosphorylation Sites Based on Deep Learning. Curr. Bioinform. 2020, 15, 300–308. [Google Scholar] [CrossRef]
Zahiri, Z.; Mehrshad, N.; Mehrshad, M. DF-Phos: Prediction of Protein Phosphorylation Sites by Deep Forest. J. Biochem. 2023, 175, 447–456. [Google Scholar] [CrossRef] [PubMed]
Wang, R.L.; Wang, Z.; Wang, H.F.; Pang, Y.X.; Lee, T.Y. Characterization and identification of lysine crotonylation sites based on machine learning method on both plant and mammalian. Sci. Rep. 2020, 10, 20447. [Google Scholar] [CrossRef] [PubMed]
Lv, H.; Dao, F.Y.; Lin, H. DeepKla: An attention mechanism-based deep neural network for protein lysine lactylation site prediction. iMeta 2022, 1, e11. [Google Scholar] [CrossRef] [PubMed]
Guan, J.H.; Xie, P.L.; Dong, D.H.; Liu, Q.C.; Zhao, Z.H.; Guo, Y.L.; Zhang, Y.L.; Lee, T.Y.; Yao, L.T.; Chiang, Y.C. DeepKlapred: A deep learning framework for identifying protein lysine lactylation sites via multi-view feature fusion. Int. J. Biol. Macromol. 2024, 283, 137668. [Google Scholar] [CrossRef]
Wen, B.; Wang, C.W.; Li, K.; Han, P.; Holt, M.V.; Savage, S.R.; Lei, J.T.; Dou, Y.C.; Shi, Z.; Li, Y.; et al. DeepMVP: Deep learning models trained on high-quality data accurately predict PTM sites and variant-induced alterations. Nat. Methods 2025, 22, 1857–1867. [Google Scholar] [CrossRef]
Yan, Y.; Jiang, J.Y.; Fu, M.Z.; Wang, D.; Pelletier, A.R.; Sigdel, D.; Ng, D.C.M.; Wang, W.; Ping, P.P. MIND-S is a deep-learning prediction model for elucidating protein post-translational modifications in human diseases. Cell Rep. Methods 2023, 3, 100430. [Google Scholar] [CrossRef]
Dai, Y.H.; Deng, L.; Zhu, F. A model for predicting post-translational modification cross-talk based on the Multilayer Network. Expert Syst. Appl. 2024, 255, 124770. [Google Scholar] [CrossRef]
Zhu, F.; Deng, L.; Dai, Y.H.; Zhang, G.Y.; Meng, F.W.; Luo, C.; Hu, G.; Liang, Z.J. PPICT: An integrated deep neural network for predicting inter-protein PTM cross-talk. Brief. Bioinform. 2023, 24, bbad052. [Google Scholar] [CrossRef]
Deng, L.; Zhu, F.; He, Y.; Meng, F.W. Prediction of post-translational modification cross-talk and mutation within proteins via imbalanced learning. Expert Syst. Appl. 2023, 211, 118593. [Google Scholar] [CrossRef]
Simpson, C.M.; Zhang, B.; Hornbeck, P.; Gnad, F. Systematic analysis of the intersection of disease mutations with protein modifications. BMC Med. Genom. 2019, 12, 109. [Google Scholar] [CrossRef]

Figure 1. Process of machine and deep learning methods. The framework design of a PTM predictor utilizing machine learning and deep learning involves the acquisition of data from existing databases, followed by pre-processing. After pre-processing, the machine learning model necessitates feature extraction and feature selection, which are essential steps before employing a classifier to finalize the model construction. In contrast, deep learning methods do not require manual feature extraction; thus, a deep learning model can be directly used to construct the classifier. Ultimately, the model was assessed using various evaluation methods.

Figure 2. Schematic of data pre-processing. P represents phosphorylation sites, green lines represent positive samples (i.e., fragments containing modification sites), and yellow samples represent unlabeled samples (i.e., those without modification site information). After segmenting the protein sequences, there are positive examples and unlabeled data available. First, reliable negative examples were obtained, and then the dataset was balanced to obtain a benchmark dataset.

Figure 3. Proportion of Research Literature on Modification Sites.

Figure 4. The percentage distribution of various feature extraction methods in the existing literature.

Table 2. Review of PTM prediction models in recent years.

PTM	Tools	Dataset	Window Size	Feature Extraction	Classifier	Result			Website	Ref.
PTM	Tools	Dataset	Window Size	Feature Extraction	Classifier	ACC	AUC	MCC	Website	Ref.
Crotonylation	BERT-Kcr	used by Lv et al. [163]	31	BERT	BiLSTM	82.0%	0.905	0.640	http://zhulab.org.cn/BERT-Kcr_models/data	[10]
Lactylation	Auto-Kla	UniProt	51	Token embedding, position embedding, transformer encoder	AutoML, MLP	91.21% ± 1.58%	0.92 ± 0.0062	0.554 ± 0.023	https://github.com/tubic/Auto-Kla	[33]
S-sulphenylation	DeepCSO	UniProtKB	35	NUM, EAAC, BE, AAindex, CKSAAP, PSSM	LSTM, CNN, RF, SVM	Arabidopsis thaliana			http://www.bioinfogo.org/DeepCSO	[37]
						78.6% ± 0.7%	0.852 ± 0.018	0.417 ± 0.032
						Homo sapiens
						77.7% ± 0.6%	0.822 ± 0.011	0.367 ± 0.028
Phosphorylation	--	dbPTM	21	AAindex, Binary-encoding, ASA, secondary structure, disordered regions, BP, MF, CC, protein functional, domain data from InterPro, KEGG pathway and functional annotation	RF, SVM	Serine			--	[44]
						--	0.95	0.78
						Threonine
						--	0.97	0.77
						Tyrosine
						--	0.99	0.57
Succinylation	SSKM_Succ	Training data: PLMD and UniProt Test data: dbPTM	21	Information of Proximal PTMs, Grey Pseudo Amino Acid Composition, K-Space, PSAAP	SVM, RF, NB	80.18%	--	0.546	https://github.com/yangyq505/SSKM_Succ.git	[47]
S-sulfenylation	SulSite-GTB	Carroll Lab, RedoxDB and UniProtKB	21	AAC, DPC, EBGW, KNN, PSAAP, PsePSSM, PWAAC	GTB	88.53%	0.94	0.77	https://github.com/QUST-AIBBDRC/SulSite-GTB/	[54]
Phosphoglycerylation	iDPGK	PLMD	15	AAC, PCAAC, AAPC, BLOSUM62, PSSM	DT, RF, SVM	74.9%	--	0.49	http://mer.hc.mmh.org.tw/iDPGK/	[74]
Succinylation	CNN-SuccSite	PLMD 3.0	31	PspAAC, CKSAAP, PSSM	CNN	86.79%	--	0.489	http://csb.cse.yzu.edu.tw/CNN-SuccSite/	[76]
Glycosylation	PTG-PLM	UniProt	31	ProtBERT-BFD, ProtBERT, ProtALBERT, ProtXLNet, ESM-1b and TAPE	CNN, SVM, LR, RF, and XGBoost	Ngly Site			https://github.com/Alhasanalkuhlani/PTG-PLM	[115]
						96.5%	0.978	0.902
						Kgly site
						64%	0.64	0.28
Formylation	LFPred	UniProt, PLMD and dbPTM	41, information entropy	AAC, BPF, AAI	KNN	79.3%	--	0.55	--	[135]
S-sulfenylation	S-Sulfenylation	Conducted by Xu et al. [164] and Hasan et al. [141]	21	PseAAC, SVV, SM, PRIM, R-PRIM, FV, AAPIV, RAAPIV	BP-NN	96.89%	0.931	0.862	https://www.github.com/ahmad-umt/S-Sulfenylation	[137]
Sumoylation	SUMO-Forest	UniProt	21	PSAAP, PseAAC, SP, BK	Cascade Forest	Cascade Forest-based cost-matrix			https://github.com/sandyye666/SUMOForest	[144]
						98.69%	0.98	0.89
						Cascade Forest based F-measure
						98.54%	0.99	0.89
						--	0.797	0.287
						sumoylation
						--	0.868	0.431
Crotonylation	--	collected verified Kcr sites on non-histone proteins from papaya	From 2 to 37	BE, CKSAAP, AAC, EAAC, EGAAC	CNN	85.64%	0.853	0.335	http://www.bioinfogo.org/pkcr	[150]
Crotonylation	DeepKcrot	Collected from [165,166,167]	29	EGAAC, WE	LSTM, CNN, RF	RFEGAAC			http://www.bioinfogo.org/deepkcrot	[151]
						0.851	0.784	0.228
						LSTMWE
						0.860	0.839	0.306
						CNNWE
						0.869	0.861	0.338
Propionylation	--	PLMD and UniProt	17	RNN, LSTM	Transfer learning, SVM	--	0.705	0.317	http://47.113.117.61/	[153]
Succinylation	HybridSucc	PLMD 3.0, PhosphoSitePlus and dbPTM	--	PseAAC, CKSAAP, OBC, AAindex, ACF, GPS, PSSM, ASA, SS, and BTA	DNN, PLR	--	0.885	--	http://hybridsucc.biocuckoo.org/	[156]
Nitrosylation	Mul-SNO	training set: Li et al. [168], independent test set: DeepNitro	31	BiLSTM, BERT	RF, lightgbm, xgboost	80%	0.80	0.59	http://lab.malab.cn/∼mjq/Mul-SNO/	[154]
Phosphorylation	PROSPECT	UniProt	27	one-of-K, EGAAC and CKSAAGP	CNNone-of-K, CNNEGAAC and RFCKSAAGP	--	0.821	0.37	http://PROSPECT.erc.monash.edu/	[157]
Crotonylation	DeepMM-Kcr	the same as those used by lv et al. [163]	31	token embedding, Positional embedding, one-hot, AAindex, PWAA	Transformer	85.56%	0.9310	0.7119	https://github.com/yunyunliang88/DeepMM-Kcr	[160]
Neddylation	DTL-NeddSite	from the literature	41	EAAC, One-hot, WE	Transfer learning	--	0.818	--	https://github.com/XuDeli123/DTL-NeddSite	[161]
Multi-PTM	DeepPTM	CPLM	21	ProtBERT	ViT	--	0.793 (succinylation)	--	https://github.com/seferlab/deepptm	[162]
Glutarylation	iGlu_AdaBoost	Conducted by Al-barakati et al. [169] from PLMD, NCBI, and SWISS-PROT	23	188D, CKSAAP, and EAAC	AdaBoost	72.07%	0.63	0.36	--	[170]
Malonylation	Kmalo	PLMD and LEMP	11~39	AAC, one hot encoding, Pse-AAC, AAindex, PSSM	hybrid models contain multiple CNNs, random forests and SVM	Mammalian proteins			https://fdblab.csie.ncu.edu.tw/kmalo/home.html	[171]
						86.6%	0.943	0.480
						Plant proteins
						69.1%	0.772	0.195
Ubiquitination	DeepTL-Ubi	PhosphoSitePlus, mUbiSida and PLMD	31	one-hot	transfer deep learning method	M. musculus			https://github.com/USTC-HIlab/DeepTL-Ubi	[172]
						60.4%	--	--
						A. nidulans
						67.9%	--	--
						T. gondii
						55.6%	--	--
Phosphorylation	--	iPhos-PseEn	13	BE	CNN, BLSTM	Phosphoserine (S)			--	[173]
						92.7%	0.996	0.582
						Phosphoserine (T)
						91.4%	0.994	0.501
						Phosphoserine (Y)
						93.6%	0.995	0.488
Phosphorylation	DF-Phos	dbPAF and Phospho.ELM	33	CTD, DDE, EAAC, EGAAC, a series of PseKRAAC, GrpDDE, kGAAC, LocalPoSpKaaF, QSOrder, SAAC, SOCNumber, ExpectedValueGKmerAA, ExpectedValueKmerAA, ExpectedValueGAA, ExpectedValueAA	Deep Forest	78%	--	0.51	https://github.com/zahiriz/DF-Phos	[174]
Crotonylation	--	UniProt and pkcr	31	AAC, AAPC, BE, CKSAAP, EAAC, EGAAC and PSSM	SVM, RF	90%	--	0.80	--	[175]
Lactylation	DeepKla	previous research, Botrytis cinerea	51	embedding	CNN, RNN	93.59%	--	0.8783	http://lin-group.cn/server/DeepKla/	[176]
Lactylation	DeepKlapred	previous research [176]	51	Position Embedding, QSOrder, CTD, DDE, DistancePair	Transformer	96.9%	--	0.938	https://awi.cuhk.edu.cn/~biosequence/DeepKlapred	[177]
Acetylation	TransPTM	UniProt	25	One-hot, ProtT5	Transformer	88%	0.83	0.45	https://www.github.com/TransPTM/TransPTM	[159]

-- indicates no relevant content; NUM: Numerical Representation for Amino Acid; ASA: Solvent accessible area; AAC: amino acid composition; DPC: dipeptide composition; EBGW: encoding based on grouped weight; KNN: k nearest neighbors; PSAAP: position-special amino acid propensity; PsePSSM: Pseudo-position specific scoring matrix; PWAAC: Position weight amino acid composition; PseAAC: pseudo amino acid composition; SP: statistics property; BK: bi-gram and k-skip-bi-gram; PspAAC: position-specifc amino acid composition; BPF: binary profile feature; AAI: amino acid index; AAPC: amino acid pair composition; PSSM: Position specific scoring matrix; B62: BLOSUM62; GPAAC: Grey Pseudo Amino Acid Composition; SVV: site vicinity vector; SM: statistical moments; PRIM: position relative incident matrix; R-PRIM: reverse position relative incident matrix; FV: frequency vector; AAPIV: accumulative absolute position incidence vector; RAAPIV: reverse accumulative absolute position incidence vector; DBPB: di-amino acid BPB; DDE: Dipeptide Deviation from Expected Mean value; EAAC: Enhanced Amino Acid Composition; Enhanced Grouped Amino Acid Composition; PseKRAAC: Pseudo K_tuple Reduced Amino Acid Composition; GrpDDE: Group Dipeptide Deviation from Expected Mean; kGAAC: k Grouped Amino Acid Composition; LocalPoSpKaaF: Local Position Specifi c k Amino Acids Frequency; QSOrder: Quasi Sequence Order; SAAC: Split Amino Acid Composition; SOCNumber: Sequence Order Coupling Number; ExpectedValueKmerAA: Expected Value for K-mer Amino Acid; ExpectedValueGAA: Expected Value for each group Amino Acid; ExpectedValueAA: Expected Value for each Amino Acid; BP: biological process; MF: molecular function; CC: cellular component.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gong, S.; Qu, K. Role of Machine and Deep Learning in Predicting Protein Modification Sites: Review and Future Directions. Information 2025, 16, 1023. https://doi.org/10.3390/info16121023

AMA Style

Gong S, Qu K. Role of Machine and Deep Learning in Predicting Protein Modification Sites: Review and Future Directions. Information. 2025; 16(12):1023. https://doi.org/10.3390/info16121023

Chicago/Turabian Style

Gong, Siliang, and Kaiyang Qu. 2025. "Role of Machine and Deep Learning in Predicting Protein Modification Sites: Review and Future Directions" Information 16, no. 12: 1023. https://doi.org/10.3390/info16121023

APA Style

Gong, S., & Qu, K. (2025). Role of Machine and Deep Learning in Predicting Protein Modification Sites: Review and Future Directions. Information, 16(12), 1023. https://doi.org/10.3390/info16121023

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Role of Machine and Deep Learning in Predicting Protein Modification Sites: Review and Future Directions

Abstract

1. Introduction

2. Datasets and Data Pre-Processing

2.1. Dataset

2.1.1. UniProt

2.1.2. dbPTM

2.1.3. CPLM 4.0

2.2. Data Pre-Processing

2.2.1. Sequence Slice

2.2.2. Sequence Redundancy

2.2.3. Selected Reliable Negative Sequences

2.2.4. Balanced Dataset

2.2.5. Data Splitting

3. Feature Engineering

3.1. Feature Extraction

3.1.1. Sequence-Based Feature

3.1.2. Physicochemical-Based Feature

3.1.3. Annotation-Based Feature

3.1.4. Deep Learning-Based Feature

3.2. Feature Reduction

4. Classifiers

4.1. Machine Learning

4.2. Deep Learning

5. Measurement

6. Summary of Predictors

7. Challenges and Future Directions

7.1. Data Limitations

7.2. Interpretability

7.3. Multi-PTM and PTM Crosstalk Prediction

8. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI