1. Introduction
Post-translational modification (PTM) is one of the most significant processes in the field of biology. More than 650 types of post-translational modification were reported across several decades of efforts. Among these types of post-translational modification, several modifications have the ability to reverse their processes. PTM provides a fine-tuned control of protein function in various types of cells in the field of disease research and drug design [
1,
2,
3,
4]. For example, the well-known tumor suppressor p53 is subject to many post-translational modifications, which have the ability to alter its localization, stability, and other related functions, thus ultimately modulating its response to various forms of genotoxic stress [
5,
6,
7,
8,
9,
10]. Therefore, p53 drives both the activation and repression of a large number of promoters, which ultimately define its tumor suppressor abilities. This tumor suppressor is a critical transcription factor in the field of post-translational modification [
11]. With these reversible modifications, protein structures change and their functions are enriched to some degree. As one of the most typical and classical reversible types of modification, lysine acetylation was reported about half a century ago [
1,
2]. Acetylation occurs on the ε-amino group of lysine residues; it was noted that three enzymes take part in this process. Whereas lysine deacetylases (KDACs) remove the acetyl groups of proteins, lysine acetyl transferases (KATs) transfer the acetyl group across proteins [
3,
4,
5,
6]. Considering the key role of lysine acetylation in several diseases and novel drug designations, a great deal of experimental approaches were proposed and introduced to identify the acetylation sites of lysine residues in protein sequences. These experimental approaches, including radioactivity chemical methods, chromatin immune precipitation (ChIP), and mass spectrometry, play their roles in various degrees [
7,
8]. Unfortunately, these experimental methods can hardly meet the need of identifying sites, and they are time-consuming and expensive. Considering this issue, effective identification methods, based on a computational biology approach, are urgently needed to identify acetylation modification sites, especially with the increasingly number of protein resources.
When it comes to computational biology methods, several classical methods were introduced in the field of protein sequence procession [
12,
13,
14,
15,
16]. Meanwhile, with the development of machine learning and artificial intelligence, some computational methods were proposed and designed to deal with similar issues at the DNA, RNA and protein levels [
17,
18,
19]. Several milestone efforts were demonstrated in the field of identification of lysine modification sites. For instance, Xu et al. made use of a support vector machine (SVM) to identification lysine acetylation sites with ensemble information [
20]. PLMLA(prediction of lysine methylation and lysine acetylation by combining multiple features), which was designed by Shi et al. in 2012, utilized information about protein sequences and secondary structure to demonstrate whether lysine residues were modified or not [
21]. In the same year, PSKAcePred (Position-Specific Analysis and Prediction for Protein Lysine Acetylation Based on Multiple Features), which was proposed by Suo et al., was based on amino-acid composition and physicochemical properties to quantify protein segments [
22]. Meanwhile, Shao et al. proposed BRABSB (bi-relative adapted binomial score Bayes), which made use of binomial score Bayesian [
23]. Since then, SSPKA (species-specific lysine acetylation prediction), based on the random forest (RF) model, was proposed in 2014 to deal with such modification sites. Two years later, Wu et al. designed a novel approach named KA-predictor (Improved Species-Specific Lysine Acetylation Site Prediction) that utilized many different kinds of features to identify cases of lysine modification [
24]. Overall, models for the effective identification of modification sites consist of two parts. The first part is feature description, which focuses on an effective method of showing protein sequence information or peptide segment information in several different aspects. The second part is the construction of the machine learning model, which aims to deal with different types of protein sequences or peptide segments with high accuracy and generalizability. The abovementioned methods, among others (PLMLA, Phosida, LysAcet, EnsemblePail, PSKAcePred, BRABSB, and SSPKA), can be regarded as the state of the art in this field.
Relationships among amino-acid residues need to be effectively described at the protein level. These relationships have the ability to demonstrate the local information of amino-acid residues in some peptide segments, and can be helpful in constructing more useful information with regards to the identification of modification sites. Some related work was proposed in DNA and RNA analysis [
25,
26,
27,
28,
29,
30,
31]; methods such as DeepBind and DeepSea take advantage of deep convolutional neural networks (CNNs) to predict the sequence specificities of DNA-binding proteins [
32,
33,
34,
35]. In summary, these sequence analysis methods can be regarded as issues resolved using computational biology.
When it comes to the abovementioned issues, Chou proposed five steps for dealing with them [
35,
36,
37,
38]. In the first step, available benchmark datasets should be selected, which are used to train and test machine learning models. In the second step, available methods for sequence quality expression should be selected. In the third step, an available algorithm should be used to identify positive and negative samples. In the fourth step, validation methods for evaluating the performances of the proposed methods should be selected. In the final step, a web resource should be constructed to detail the workflow, along with related raw data. Therefore, in this paper, we introduce a method for the identification of lysine acetylation sites following these steps.
In this article, we propose lysine acetylation site identification with polynomial tree method (LAIPT), making use of the polynomial style to demonstrate amino-acid residue relationships in peptide segments. This polynomial style was enriched by the physico-chemical properties of amino-acid residues. Then, these reconstructed features were input into the employed classification model, named the flexible neural tree (FNT). Finally, some effect evaluation measurements were employed to test the model’s performance. And the website of this work is shown in
http://121.250.173.184/.
3. Materials and Methods
Because of the ubiquity and universality of lysine acetylation at the protein level, we can find several acetylated proteins in various databases, including NCBI (National Center for Biotechnology Information), Uniprot, and other related proteomics databases. In this study, we selected about 30,000 protein sequences, which contain more than 111,200 acetylation sites among them [
49]. These proteins could be extracted from the Protein Lysine Modification Database (PLMD) version 3.0 [
50]. PLMD is one of the most well-known and commonly used post-translational modification site databases, and it contains more than 20 types of lysine modification in more than 170 species at the protein level. Generally, this database can be treated as the largest available acetylation database; thus, it was employed as the benchmark dataset in this work. Unfortunately, overestimation may be one of the most significant limitations when using machine learning. In order to overcome this shortcoming, CD-HIT (Cluster Database at High Identity with Tolerance) was utilized to remove some homologous sequences [
51,
52,
53,
54]. In this work, we utilized a threshold of 40% similarity with this tool. Following this process, we obtained 59,532 proven acetylated modification sites from 20,527 protein sequences. These protein data were used to construct the training, testing, and independent datasets. During this classification process, we defined the proven acetylated sites as positive samples and the non-proven modifications as negative samples. Detailed information of the employed datasets is shown in
Table 9, and details with regards to the construction of datasets are shown in
Figure 4.
In this work, we employed the general dataset as the training and testing datasets. In order to evaluate the generalization and stability, we employed three species incorporating lysine acetylation sites as the independent datasets.
After constructing the available datasets, some peptide segments were extracted from the whole protein sequences. In order to reduce the unnecessary usage of storage space and computational resources, some peptides with a central lysine residue were extracted in this work. We made use of sliding windows to extract peptide segments with a size of 2
n + 1 [
55], where
n is the length of the upstream or downstream fragment, and 1 is the position of the central lysine residue in the segment. In this work, the length of the upstream fragment was equal to that of the downstream fragment, and
n ranged from 10 to 15. Thus, the whole length of the sliding window was between 21 and 31. In the next section, we discuss the performances of the various selected lengths of sliding window.
3.1. Encoding of Protein Fragments
Several different types of features for quantifying biological sequences were presented across many years of protein research, such as amino-acid composition, position special scoring matrix, physico-chemical properties, and other related features [
56,
57,
58] These features can demonstrate sequence information in various aspects, and they play various roles in protein sequence analysis. However, few features can demonstrate the relationships of amino-acid residues. In this paper, each peptide was treated as a sample. According to biological concepts, neighboring amino-acid residues present both coordination and individual functions. On this basis, we tried utilizing some of these functions to describe the relationships in this work.
We propose a polynomial method to describe the relationships between the central lysine residue and the neighboring amino-acid residues. Several forms of polynomial styles exist, such as the constant form, linear function form, quadratic function form, cubic function form, and so on. For example, we show the curves of these four forms in
Figure 5.
In
Figure 5, L1, L2, L3, L4, and L5 follow Equations (4)–(8), respectively.
From
Figure 5, we can easily determine that both L2 and L4 are even functions, while the other curves are odd functions. Considering that the upstream and the downstream fragments played the same role in the selected peptide segments, the even functions were selected for this work; therefore, we utilized three types of functions. The first one was the constant function, whereby all amino-acid residues in the peptide segments have the same influence, as described in Equation (1). The second function followed Equation (2), and the third function followed the Equation (3).
where the parameters
a1,
a2,
b1,
b2, and
c1 were optimized in this work. It was noted that both Equations (1) and (2) could hardly be described as linear functions. Thus, the center of the last two functions was designated as the origin point, i.e., the classified modification sites in the peptide segments. Regions to the left and right part of this origin point were designated as the upstream and the downstream segments, respectively. The influence of each neighboring amino-acid residue is defined below.
According to Equation (1), the relationship between a neighboring amino-acid residue and the central lysine is shown in Equation (9).
where influ1 contains 2
n + 1 elements in each sample, and
c1 is the relationship between each amino-acid residue in the selected peptide segment. In this function, every amino-acid residue has the same influence; thus, the amino-acid composition can be regarded as a special form of this style.
According to Equation (2), the relationship between the neighboring and central residues are shown in Equation (10).
where influ2 also contains 2
n + 1 elements, and each value of influ2 follows the discrete values of Equation (2) and has the range [−
n,
n].
According to the Equation (3), the relationship between two amino-acid residues is shown in Equation (11).
where he influ3 also contains 2
n + 1 elements, and each value of influ3 follows the discrete values of Equation (11) and has the range [−
n,
n].
After demonstrating the fundamental relationship of amino-acid residues within the classified peptide, the next step was to enrich the related properties of amino-acid residues. In this step, physical, chemical, evolutional, structural, and other related information was enriched using the three styles proposed above.
3.2. Physico-Chemical Properties
Physico-chemical properties are widely and successfully utilized in the identification of protein post-translational modifications, including ubiquitination, phosphorylation, and others [
59,
60]. These properties can help determine the fundamental characteristics of proteins in several aspects. One of the most well-known and widely utilized databases is AAIndex [
61,
62], which contains a great deal of physico-chemical and biochemical information for each amino-acid residue and some amino-acid compositions. The latest version of this database describes 544 properties of amino acid residues. Among these properties, following previous efforts and research [
62], we selected several of them, which are listed in
Table 10.
Considering the abovementioned elements, we minimized the presence of useless information; therefore, the area under the receiver operating characteristic (ROC) curve (AUC) was used to evaluate the measurements in this work.
3.3. Prediction Algorithm
The computational identification of modification sites focuses on classification models in the field of machine learning. In this thesis, we employed machine learning models, including the flexible neural tree. We employed three machine learning methods for the three elements in the classification. The first element involved the bandwidth of the sliding windows in the classified peptide segments, the second involved the parameters of polynomial feature description, and the third involved the selection of different combinations. Therefore, the classification model was designed to deal with these three elements; the detailed outline of this algorithm is demonstrated in
Figure 6.
The flexible neural tree (FNT) was proposed by Chen [
63,
64], and it can be treated as an alternative tree neural network. Therefore, this model can be utilized to deal with the issues of classification and prediction in the field of machine learning. The typical structure of an FNT is shown in
Figure 7.
From the above figure, we can easily determine that the model contains three types of layers—the input layer, the hidden layer, and the output layer. The network function of this model is shown in Equations (12) and (13).
where
wj is the weight of the
j-th input element, and
yj is the
j-th element of the input sample. Both
mi and
ni are parameters in this network.
3.4. Performance Measurements
Some well-known methods exist in the field of machine learning for evaluating performance measurements. In this work, some typical measurements, including sensitivity, specificity, accuracy, F1 scores, and Matthew’s correlation coefficients (MCCs) [
65,
66], of the identified modification sites were used. Furthermore, the AUC [
67] was also employed to test the performance of imbalanced classification problems, whereby the negative sample size was much bigger than the positive sample size.
In this classification problem, samples can be defined as two types—positive samples and negative samples. Positive samples refer to peptide segments where the central lysine is acetylated, while negative samples refer to peptide segments where the central lysine is not. According to the definitions of the classified samples, there can be four outcomes. If a positive sample is classified as true, this can be deemed a true positive (
TP). If a positive sample is classified as false, this can be deemed a false positive (
FP). Following this concept, a negative sample classified as true is a true negative (
TN), and a negative sample classified as false is a false negative (
FN). According to the number of
TP,
TN,
FP, and
FN, we can easily obtain measures of sensitivity, specificity, accuracy, F1 scores, and MCC.
where
P is the number of positive samples and
N is the number of negative samples. Nevertheless, in Equations (14)–(18), there is a lack of intuitiveness, and they can hardly be described as easy to understand for the majority of researchers in the field of biology. The interpretation of MCC in particular is not at all intuitive in this form, although this measurement plays a key role in the evaluation of the classification model’s stability. Therefore, we made use of the concept based on Chou, proposed at the beginning of this century. In this concept, the total number of positive samples can be defined as
N+, and the total number of negative samples can be defined as the
N−. Then, the number of misclassified positive samples can be treated as the
, and the number of misclassified negative samples can be treated as the
. With this definition,
TP,
TN,
FP, and
FN can be described in Equations (19)–(22).
Thus, the abovementioned measurements can be newly defined as Equations (23)–(27).
The interpretations of each performance metric in Equations (23)–(27) are far more intuitive and easier to understand for biological researchers. For instance, when samples can be correctly classified, whereby all positive samples are classified as true and all negative samples are classified as false, we get
= 0 and
= 0, and the sensitivity and specificity are both equal to 1. Meanwhile, the accuracy is equal to 1 and MCC is also equal to 1 in such a situation. On the contrary, if all positive samples are classified as false and all negative samples are classified as true,
and
are both equal to 1, and the sensitivity and specificity are both equal to 0. Furthermore, the accuracy is equal to 0, and the MCC is equal to −1 in this situation. In a random classification issue,
= 0.5
N− and
= 0.5
N+. Thus, the accuracy is equal to 0.5 and MCC is equal to 0 in this situation. This definition method has several advantages [
68,
69,
70,
71]; however, utilizing these five measurements can hardly meet required performance in a scenario of imbalanced classification. Therefore, we made use of ROC and precision recall. ROC can be shown by the relationship between the true positive rate (TPR) and the false positive rate (FPR) in the classification. Meanwhile, precision recall can be demonstrated by the relationship between the precision and recall.