Attenphos: General Phosphorylation Site Prediction Model Based on Attention Mechanism

Phosphorylation site prediction has important application value in the field of bioinformatics. It can act as an important reference and help with protein function research, protein structure research, and drug discovery. So, it is of great significance to propose scientific and effective calculation methods to accurately predict phosphorylation sites. In this study, we propose a new method, Attenphos, based on the self-attention mechanism for predicting general phosphorylation sites in proteins. The method not only captures the long-range dependence information of proteins but also better represents the correlation between amino acids through feature vector encoding transformation. Attenphos takes advantage of the one-dimensional convolutional layer to reduce the number of model parameters, improve model efficiency and prediction accuracy, and enhance model generalization. Comparisons between our method and existing state-of-the-art prediction tools were made using balanced datasets from human proteins and unbalanced datasets from mouse proteins. We performed prediction comparisons using independent test sets. The results showed that Attenphos demonstrated the best overall performance in the prediction of Serine (S), Threonine (T), and Tyrosine (Y) sites on both balanced and unbalanced datasets. Compared to current state-of-the-art methods, Attenphos has significantly higher prediction accuracy. This proves the potential of Attenphos in accelerating the identification and functional analysis of protein phosphorylation sites and provides new tools and ideas for biological research and drug discovery.


Introduction
Protein is an important living organism component, regulating life activities, controlling growth and energy metabolism, etc. Post-translational modifications (PTMs) are important cellular control mechanisms that increase the functional diversity of the proteome through the covalent addition of functional groups or proteins, the hydrolytic cleavage of regulatory subunits, or the degradation of whole proteins [1].Protein phosphorylation is the most studied PTM, and the most studied phosphorylation to date is on serine (S), threonine (T), and tyrosine (Y).Phosphorylation site prediction can help researchers understand protein function and interactions, particularly in the areas of signal transduction and metabolic pathways [2].For example, it can help researchers to study the relationship between proteins and biological processes such as cell cycle, apoptosis, and cell proliferation.And phosphorylation site prediction can provide important information for protein structure studies, as phosphorylation can change the conformation and charge of proteins and can help researchers predict protein secondary structures, tertiary structures, and interactions [3].In addition, some studies have shown that more than 30% of eukaryotic proteins can be phosphorylated, and most of these proteins are closely associated with the development of human diseases, such as cancer [4].Therefore, the study of the principles, types, and mechanisms of protein phosphorylation is important for understanding the pathogenesis of human diseases.Research on protein phosphorylation site prediction can also help to identify new drug targets, thus providing new ideas and directions for disease treatment.
Early researchers performed many bioassay identification methods to identify phosphorylation sites, such as low-throughput 32P labeling [5] and high-throughput mass spectrometry [6], but such methods are quite time-consuming and laborious.With the rapid development of natural language processing, artificial intelligence, and other disciplines, and the integration of various new technologies with bioinformatics, the difficulties of traditional experimental methods have been effectively solved.Dozens of different methods, including machine learning, statistics, feature extraction, and other techniques [7][8][9], have been used to predict and identify phosphorylation sites.For example, Musite, proposed in 2010, used multiple features and algorithms to improve prediction accuracy by using local amino acid sequence frequencies, k-nearest neighbor features, and protein disorder scores.This was followed by MusiteDeep [10], the first deep learning framework for predicting general and kinase-specific phosphorylation sites, which was proposed by the team in 2017, using a convolutional neural network with a novel two-dimensional attention mechanism for prediction.In 2019, Fenglin Luo et al. proposed DeepPhos [11], which utilizes densely connected convolutional neuron network blocks to capture multiple representations of sequences.Final phosphorylation prediction was performed through intra-block crosstalk layers and inter-block crosstalk layers.These deep learning architectures for phosphorylation site prediction are far superior to traditional experimental methods.However, there is still room for improvement in prediction accuracy and prediction efficiency.It is essential to develop a prediction tool that performs well at S, T, and Y sites.
These deep learning architectures for phosphorylation site prediction are far superior to traditional experimental methods.However, there are more negative samples than positive samples for almost all PTM types, including protein phosphorylation [10], which makes the performance of these methods limited.By training with unbalanced datasets, the model can have better generalization ability in real situations.In addition, there is still a lot of room for improvement in the generalization ability of these models, which are not yet able to achieve a high level of prediction performance on S, T, and Y sites with small sample sizes at the same time.Therefore, there is still room for further improvement in terms of prediction accuracy and efficiency.Therefore, it has become crucial to develop a prediction tool that can perform well on both balanced and unbalanced datasets with higher accuracy and better generalization ability.
In response to the above challenges, this study proposes a new deep learning framework to accurately predict general phosphorylation sites using protein sequence information with good robustness and generalization ability.Unlike previous deep learning prediction methods, Attenphos consists of self-attention and CNN blocks that use three structurally identical modules with different window sizes to capture significant sequence representations of protein phosphorylation sites.The feature vector output of the attention module is converted using the Encoding conversion module, increasing the representation capability of the neural network.The transformed coded feature matrix has more structural information and learns features more fully than if the output features were processed directly.The final three outputs are further integrated to make the final prediction.To evaluate the performance of Attenphos, we collected a large number of validated phosphorylation examples for model training and evaluation, and we conducted experimental comparisons with some other prediction tools and deep learning methods.The experimental results show that our proposed method outperforms some existing prediction tools on both balanced and unbalanced datasets.Attenphos achieves the best overall prediction performance on T, S, and Y sites with small sample sizes.

Comparison Experiments on Balanced Datasets
In this section, we provide a discussion of the comparative performance of Attenphos for general phosphorylation site prediction using a balanced dataset from human protein phosphorylation.We compared Attenphos with several existing general phosphorylation site prediction tools, including GPS 2.1 [12], NetPhos 3.0 [13], PPRED [14], Musite [15], PhosphoSVM [16], SKIPHOS [17], Deepphos [11], and Transphos [18].We performed 10-fold cross-validation to train the model using the training datasets of the S, T, and Y sites.We tested it using an independent PPA test set.As shown in Table 1, the results indicate that our method obtains the best overall performance in S and T bit prediction.And on the Y dataset with a smaller sample size, our method still outperforms existing state-of-the-art deep learning architectures to obtain the best overall performance.

Comparison Experiments on Unbalanced Datasets
As shown in Table 2, we conducted performance comparison experiments on a dataset with positive and negative sample unbalance distribution, which is a phosphorylation dataset from mouse proteins.In order to assess the generalization capability of our model on unbalanced datasets with limited samples, we specifically selected the advanced deep learning models Transphos and Deepphos for a focused comparison.Both Transphos and Deepphos perform well on the balanced dataset.We trained and validated individual models, including Deepphos, Transphos, and Attenphos.Then, we tested them using independent test sets.Considering the unbalance in the number of positive and negative samples, we used evaluation metrics of SN, SP, MCC, AUC, and F1 scores for comparison.The results are shown in the table above.At the T site, Attenphos achieves the highest SP, AUC, MCC, SN, and F1 values of 87.79%, 0.775, 0.416, 50.91%, and 0.566, respectively.Compared with Transphos, its AUC value is increased by 0.9% and its MCC value by 2.2%.At the Y site, Attenphos achieves the highest SP, SN, AUC, F1, and MCC values.Attenphos obtains the highest MCC and F1 values in S-site phosphorylation prediction.Compared to Transphos, Attenphos shows an increase of 0.5% in the AUC value and 0.3% in the F1 score value.In addition, the MCC, SN, and SP values also perform better.Our research findings unveil the superiority of Attenpphos over Deepphos and Transphos, as confirmed by the ROC curve analysis.The ROC curves of Attenpphos consistently exhibit a larger area under the curve (AUC), indicating the better classification performance of our model compared to the other two models.The ROC graph is shown in Figure 1 below.This indicates that our model can still show excellent performance in the presence of positive and negative sample unbalance.In real-world protein phosphorylation problems, where the distributions of positive and negative samples are often unbalanced, our model demonstrates significant performance improvement in phosphorylation prediction at the S, T, and Y sites, achieving higher overall performance.This suggests that Attenphos is not only applicable to datasets with balanced distributions but also to datasets with unbalanced distributions that are close to the real situation.

Ablation Experiments
We investigated the contribution of the Encoding conversion block to general phosphorylation site prediction.In real-world protein phosphorylation problems, where the distributions of positive and negative samples are often unbalanced, our model demonstrates significant performance improvement in phosphorylation prediction at the S, T, and Y sites, achieving higher overall performance.This suggests that Attenphos is not only applicable to datasets with balanced distributions but also to datasets with unbalanced distributions that are close to the real situation.

Ablation Experiments
We investigated the contribution of the Encoding conversion block to general phosphorylation site prediction.Figure 2   In real-world protein phosphorylation problems, where the distributions of positive and negative samples are often unbalanced, our model demonstrates significant performance improvement in phosphorylation prediction at the S, T, and Y sites, achieving higher overall performance.This suggests that Attenphos is not only applicable to datasets with balanced distributions but also to datasets with unbalanced distributions that are close to the real situation.

Ablation Experiments
We investigated the contribution of the Encoding conversion block to general phosphorylation site prediction.Figure 2  In this section, we performed ablation experiments on the effect of the attention mechanism in the Attenphos method.The experiment results are shown in Table 3 below.As shown in the table, the performance of Attenphos-noAttention undergoes a large decrease compared to Attenphos at the S, T, and Y sites.Although the F1 score improves by 1% and 0.6% at the S and T sites, other metrics such as Mcc and Auc show a large decrease.At the Y site, Attenphos-noAttention decreases by 6.1% in AUC, 7.2% in mcc, and 5.1% in F1 score.Self-attention mechanisms can focus on the relationship between different amino acids.The self-attention mechanism module enables our model to better and more efficiently capture the long-range dependencies of protein sequences and extract global feature information, thus improving the prediction accuracy.

Feature Visualization
To more visually demonstrate the effectiveness of the method in this paper for the classification of phosphorylated versus non-phosphorylated sites, the features extracted by Attenphos were visualized.We chose a commonly used dimensionality reduction visualization method, t-SNE (t-Distributed Stochastic Neighbor Embedding) [19], which can be used to map high-dimensional data into a low-dimensional space and preserve the relative distance relationship between the data.It can reduce the dimensionality of the high-dimensional features and visualize the results.Figure 3 shows the visualization of the original features encoded by embedding and the features extracted by Attenphos.It is clear that the features encoded by embedding are visualized in a cluttered manner.However, the extracted features of Attenphos can easily separate phosphorylated and unphosphorylated samples, indicating that our method can predict the generalized phosphorylated site with high accuracy.

Discussion
We propose a new deep learning method named Attenphos.It can predict potential general phosphorylation sites.And experiments comparing our model with existing methods show that our method has better overall performance compared to other existing methods.The protein sequences are centered on the prediction sites, intercepting short sequences with different window sizes.Short sequences are encoded and input into three modules with the same structure, in which the output features of the self-attention block are transformed and encoded.Then, when convolution is used, effective features can be better extracted.Finally, the three parts are combined into a combined feature to fully improve the generalization ability of the enhanced model and the prediction of the model accuracy.And self-attention can effectively solve the problem of long-distance dependence on protein phosphorylation.
Although our method has achieved better performance, there are still some shortcomings for improvement.(1) In future work, we can improve the framework using some deep learning modules that work well for classification, such as BiGRU [20], which can be explored in the next work.(2) We could attempt to include information other than the protein sequence.For example, protein structure information can provide information about the environment of amino acid residues, such as residue accessibility, local secondary structure, inter-residue distance, etc. PPI [21] information can provide association

Discussion
We propose a new deep learning method named Attenphos.It can predict potential general phosphorylation sites.And experiments comparing our model with existing methods show that our method has better overall performance compared to other existing methods.The protein sequences are centered on the prediction sites, intercepting short sequences with different window sizes.Short sequences are encoded and input into three modules with the same structure, in which the output features of the self-attention block are transformed and encoded.Then, when convolution is used, effective features can be better extracted.Finally, the three parts are combined into a combined feature to fully improve the generalization ability of the enhanced model and the prediction of the model accuracy.And self-attention can effectively solve the problem of long-distance dependence on protein phosphorylation.
Although our method has achieved better performance, there are still some shortcomings for improvement.(1) In future work, we can improve the framework using some deep learning modules that work well for classification, such as BiGRU [20], which can be explored in the next work.(2) We could attempt to include information other than the protein sequence.For example, protein structure information can provide information about the environment of amino acid residues, such as residue accessibility, local secondary structure, inter-residue distance, etc. PPI [21] information can provide association information in protein interaction networks, such as the tendency of proteins in the same signaling pathway to share phosphorylation sites.So, we could try to combine protein sequence, structure, and PPI information and use deep learning models to perform prediction, which can be further explored in future work.(3) Due to the small kinase-specific phosphorylation dataset available and the small number of samples, our model does not have good prediction performance on small samples, and more advanced model architecture is needed for kinase phosphorylation prediction, which will be the goal of our next work.
In summary, our deep learning method Attenphos shows superior performance in protein general phosphorylation site prediction with high accuracy and stability.The method is based on an attention mechanism which works by filtering out the most predictive features and optimal model parameters for prediction.Our method can provide a useful tool not only in basic research but also in biomedical and drug discovery applications, helping to accelerate the identification and study of phosphorylation sites.This study provides a strong reference for future bioinformatics and proteomics studies and offers new ideas for understanding the mechanisms of protein phosphorylation regulation.

Datasets
This study is based on the prediction of general phosphorylation sites.We used the phosphorylation site training dataset provided by Phospho.ELM [22], a database containing experimentally validated, manually curated phosphorylation sites.By integrating information from literature reports and experimental data, the Phospho.ELM database provides extensive annotations of phosphorylation sites and related biological information.We used human protein phosphorylation data and mouse protein phosphorylation data from this database.
For mouse phosphorylation data, as shown in Table 4, our study used the mouse protein phosphorylation dataset from the PELM database and removed redundant data using the CD-HIT tool [23] with a redundancy threshold set at 40%.Since the number of positive and negative samples for phosphorylation is not balanced in the real world, we chose negative samples in this study at a ratio of 1:2 for positive and negative samples.The negative samples were randomly selected on the same protein sequence, excluding identical amino acids outside the labeled phosphorylation sites.Firstly, the dataset was randomly divided into an 80% training set and 20% independent test set, and secondly, 10% of the training set was used as a validation set during training to prevent overfitting.For phosphorylation data from human proteins, we used the same training dataset, validation set, and independent test set as in Transphos [18].The number of positive and negative samples in the dataset was balanced.Among them, the training dataset was obtained from the human phosphorylation data in the Phospho.ELM database, as shown in Table 5.Meanwhile, the independent test set was obtained from the PhosPhAt (PPA) [24][25][26] database, as shown in Table 6.In the above way, we constructed the phosphorylation site datasets for training and evaluation to support our research work.In order to better extract the features near the phosphorylation site, we separately cut the original sequence into short sequences of different lengths centered on the prediction site for experiments.We explored many window combinations by conducting experiments and finally chose the optimal input window combination of 21, 33, and 51.We chose sequence fragments with left and right lengths of 10, 16, and 25, centered on the prediction site.If the left and right lengths of the predicted site were insufficient, * was filled in to ensure that each subsequence had the same length, and the negative samples were manipulated as above.We digitally encoded the short protein sequences.Then, the encoded sequence digital representation was encoded using embedding.When encoding with one-hot, it cannot represent the relationship between adjacent amino acids very well, meaning that the efficiency of training is not very high.The use of embedding can optimize this problem very well.The feature vectors after embedding encoding were 21 × 16, 33 × 16, and 51 × 16.

Model Architecture
The deep learning architecture of Attenphos is shown in Figure 4. We first processed the protein sequence length and extracted short sequences with window sizes of 21, 33, and 51.Experiments have shown that the length combinations of 21, 33, and 51 are the most appropriate.Then, we encoded short sequences by embedding and input the encoded feature vectors to three structurally identical self-attention blocks for feature extraction.Using self-attention could fully obtain the long-range dependence information of protein sequences.Then, the output features were separately transformed and encoded using the Encoding conversion module, converting each feature of length 16 into 4 × 4 features to better learn the relationship features between amino acids.Then, the transformed feature vectors were input into a one-dimensional convolutional block.The resulting three-part output vectors were combined into an integrated feature.Then, it was input to a flatten layer, transformed into a one-dimensional tensor.And then, it passed through several layers of fully connected layers.Finally, the output results were obtained by softmax.

Self-Attention
For neural networks, the input vectors are generally of different sizes.There is a certain relationship between different vectors, which cannot be reflected by the embedding vectors.But the actual training cannot give full play to the relationship among these inputs, resulting in very poor model training results.
The self-attention mechanism is introduced in this model, the specific structure of which is shown in Figure 1a, and the self-attention mechanism is one of the attention mechanisms.The problem of poor model performance due to the inability to establish a correlation between the relevant inputs above can be solved by this.This allows our model to notice correlations between different parts of the whole input and to learn the importance of the features, providing more weight to the more important features to represent the sequence.
In the self-attention mechanism, the input sequence is first subjected to feature extraction.It usually uses a linear transformation to map the input sequence into the Query, Key, and Value vector spaces.Specifically, for the input sequence X = {x1, x2, …, xn}, Query, Key, and Value can be generated in the following ways: The self-attention mechanism is introduced in this model, the specific structure of which is shown in Figure 1a, and the self-attention mechanism is one of the attention mechanisms.The problem of poor model performance due to the inability to establish a correlation between the relevant inputs above can be solved by this.This allows our model to notice correlations between different parts of the whole input and to learn the importance of the features, providing more weight to the more important features to represent the sequence.
In the self-attention mechanism, the input sequence is first subjected to feature extraction.It usually uses a linear transformation to map the input sequence into the Query, Key, and Value vector spaces.Specifically, for the input sequence X = {x1, x2, . .., xn}, Query, Key, and Value can be generated in the following ways: where X denotes the input sequence;W Q , W k , and W V is the learnable weight matrix; and Q, K, and V denote the generated Query, Key, and Value, respectively.
where q i denotes the i-th element of the Query vector, k j denotes the j-th element of the Key vector, d k denotes the dimension of the Key vector, and α ij denotes the attention weight between position i and position j.
where v j denotes the j-th element of the Value vector and o i denotes the i-th element of the Output vector.
In our model, we used a self-attention layer with 16 input units and output features of the win (window length) ×16, which allowed the correlation between the features of each amino acid to be better expressed.

Encoding Conversion
We transformed the feature vector after the output of self-attention by changing the features of length 16 into 4 × 4 features, as shown in Figures 4b and 5a,b.This is because turning the 16-length features into 4 × 4 features increased the representation capability of the neural network.The 4 × 4 feature matrix had more structural information.This approach can better take advantage of the one-dimensional convolutional layer when processing sequential data to improve the model efficiency and the generalization capability of the model, and it also can reduce the number of model parameters.For example, if the original data are considered as image data, a higher dimensional feature matrix retains more detailed information in the image and helps the network to better learn the features in the image.This transformation encoding the features in a way that allows for better learning of amino acid sequences.The transformed feature vectors were fed into a onedimensional convolutional layer with the stride set to 4 and the convolutional kernel set to 8.This configuration allowed the relationships between amino acids to be fully learned and more detailed information to be retained.We verified through experimental results that this method helped in the extraction of model features and improved the accuracy of the model prediction.
where X denotes the input sequence;  ,  , and  is the learnable weight matrix; and Q, K, and V denote the generated Query, Key, and Value, respectively.

𝑠 = 𝑞 ⋅ 𝑘
(4) where  denotes the i-th element of the Query vector,  denotes the j-th element of the Key vector,  denotes the dimension of the Key vector, and  denotes the attention weight between position i and position j.

𝑜 = 𝛼 𝑣
where  denotes the j-th element of the Value vector and  denotes the i-th element of the Output vector.
In our model, we used a self-attention layer with 16 input units and output features of the win (window length) ×16, which allowed the correlation between the features of each amino acid to be better expressed.

Encoding Conversion
We transformed the feature vector after the output of self-attention by changing the features of length 16 into 4 × 4 features, as shown in Figures 4b and 5a,b.This is because turning the 16-length features into 4 × 4 features increased the representation capability of the neural network.The 4 × 4 feature matrix had more structural information.This approach can better take advantage of the one-dimensional convolutional layer when processing sequential data to improve the model efficiency and the generalization capability of the model, and it also can reduce the number of model parameters.For example, if the original data are considered as image data, a higher dimensional feature matrix retains more detailed information in the image and helps the network to better learn the features in the image.This transformation encoding the features in a way that allows for better learning of amino acid sequences.The transformed feature vectors were fed into a onedimensional convolutional layer with the stride set to 4 and the convolutional kernel set to 8.This configuration allowed the relationships between amino acids to be fully learned and more detailed information to be retained.We verified through experimental results that this method helped in the extraction of model features and improved the accuracy of the model prediction.

Model Training
Attenphos is a deep learning framework designed to predict sites in protein sequences where phosphorylation is likely to occur.The framework uses a self-attention mechanism to learn the interactions between amino acid residues to extract effective feature vectors for phosphorylation site prediction.To prevent overfitting, Attenphos uses

Model Training
Attenphos is a deep learning framework designed to predict sites in protein sequences where phosphorylation is likely to occur.The framework uses a self-attention mechanism to learn the interactions between amino acid residues to extract effective feature vectors for phosphorylation site prediction.To prevent overfitting, Attenphos uses dropout layers in the convolutional and fully connected layers.The model uses the categorical cross-entropy loss function, which is minimized by the Adam algorithm, and L2 regularization, which adjusts the model parameters to better fit the data in training.The specific hyper-parameter choices are shown in Table 7.
y i ln P( y i = 1 x j ) + (1 − y i ) ln P( y i = 0 x j ) (7) N is the number of samples, x j refers to the jth input localized sequence, and y j refers to the corresponding phosphorylated state label of the jth input sequence.
TP (true positive): a prediction of 1 and a true value of 1, i.e., true positive; FP (false positive): a prediction of 1 and a true value of 0, i.e., false positive; TN (true negative): a prediction of 0 and a true value of 0, i.e., true negative; FN (false negative): a prediction of 0 and a true value of 1, i.e., false negative.

Conclusions
In this study, we propose a new framework, Attenphos, based on deep learning to accurately predict general phosphorylation sites using protein sequence information.Attenphos has good robustness and generalization ability.Unlike previous deep learning prediction methods, Attenphos consists of self-attention and CNN modules that use three modules with the same structure and different window sizes to capture important sequence representations of protein phosphorylation sites.On the balanced dataset, Attenphos achieved AUC values of 0.788, 0.691, and 0.625 at the S, T, and Y sites, respectively.On the unbalanced dataset for mouse species, AUC values of 0.831, 0.775, and 0.658 and MCC values of 0.511, 0.416, and 0.262 were achieved at the S, T, and Y sites, respectively.It was demonstrated that Attenphos achieved better results on both balanced and unbalanced datasets.The experiment proves that our model outperforms other existing methods and deep learning models in all aspects.Also, we experimentally proved the importance of the Encoding conversion block.Adding the Encoding conversion block can significantly improve the performance of the model.We chose a commonly used downscaling visualization method, t-SNE, and the features extracted by our method, Attenphos, can easily separate phosphorylated sites from unphosphorylated sites compared to the features encoded by embedding.

Figure 1 .
Figure 1.ROC curve plots of Attenphos, Transphos, and Deepphos on the mouse dataset ((a-c) indicate the S, T, and Y sites).
Figure 2 shows that removing the Encoding conversion block in the Attenphos architecture decreases the overall performance of general phosphorylation site prediction.For example, at the S site, removing the Encoding conversion causes the Sp, F1, Auc, and Mcc values to be reduced at the S site.At the T site, the removal of Encoding conversion reduces the values of Sn, F1, Auc, and Mcc.At the Y site, the removal of Encoding conversion reduces the values of Sn, F1, Auc, and Mcc.Therefore, the use of the Encoding conversion module allows the model to better capture the relational features

Figure 1 .
Figure 1.ROC curve plots of Attenphos, Transphos, and Deepphos on the mouse dataset ((a-c) indicate the S, T, and Y sites).
shows that removing the Encoding conversion block in the Attenphos architecture decreases the overall performance of general phosphorylation site prediction.For example, at the S site, removing the Encoding conversion causes the Sp, F1, Auc, and Mcc values to be reduced at the S site.At the T site, the removal of Encoding conversion reduces the values of Sn, F1, Auc, and Mcc.At the Y site, the removal of Encoding conversion reduces the values of Sn, F1, Auc, and Mcc.Therefore, the use of the Encoding conversion module allows the model to better capture the relational features prior to the amino acids, focusing on key information and thus improving the overall predictive performance of the model.

Figure 1 .
Figure 1.ROC curve plots of Attenphos, Transphos, and Deepphos on the mouse dataset ((a-c) indicate the S, T, and Y sites).

Figure 2 .
Figure 2. Comparison of ablation experiment results.(a), (b), and (c) indicate the comparison of experimental results at the S site, T site, and Y site, respectively.

Figure 3 .
Figure 3. t-SNE visualization of Attenphos extracted features and embedding encoded features.(a), (b), and (c) represent the visualization plots after feature extraction using Attenphos at the S, T, and Y sites, respectively, and (d), (e), and (f) represent the feature visualization plots after encoding using embedding at the S, T, and Y sites, respectively.

Figure 3 .
Figure 3. t-SNE visualization of Attenphos extracted features and embedding encoded features.(a), (b), and (c) represent the visualization plots after feature extraction using Attenphos at the S, T, and Y sites, respectively, and (d), (e), and (f) represent the feature visualization plots after encoding using embedding at the S, T, and Y sites, respectively.

Figure 4 .
Figure 4. Attenphos framework.(a) represents the self-attention mechanism, and (b) represents encoding conversion.4.2.1.Self-AttentionFor neural networks, the input vectors are generally of different sizes.There is a certain relationship between different vectors, which cannot be reflected by the embedding vectors.But the actual training cannot give full play to the relationship among these inputs, resulting in very poor model training results.The self-attention mechanism is introduced in this model, the specific structure of which is shown in Figure1a, and the self-attention mechanism is one of the attention mechanisms.The problem of poor model performance due to the inability to establish a correlation between the relevant inputs above can be solved by this.This allows our model to notice correlations between different parts of the whole input and to learn the importance of the features, providing more weight to the more important features to represent the sequence.In the self-attention mechanism, the input sequence is first subjected to feature extraction.It usually uses a linear transformation to map the input sequence into the Query, Key, and Value vector spaces.Specifically, for the input sequence X = {x1, x2, . .., xn}, Query, Key, and Value can be generated in the following ways:

Table 1 .
Model comparison on balanced dataset.(a), (b), and (c) represent the results of model comparison for S site, T site, and Y site, respectively., at the S site, our model obtains the highest Mcc value of 0.443 and the highest Auc value of 0.788, while other metrics such as Sn, Sp, and F1 values are also in the lead.In the case of the Mcc metric, we achieve an improvement of 11% compared to the state-of-the-art prediction tool Transphos.At the T site, our method shows the best performance, with Attenphos obtaining the highest Mcc and Auc values compared to other methods.Compared to GPS 2.1, NetPhos 3.0, PPRED, Musite, PhosphoSVM, SKIPHOS, Deepphos, and Transphos, our architecture improves the Auc values by 2.1%, 5.1%, 11.3%, 6.9%, 2.6%, 4.8%, 1.7%, and 1.9%, respectively, and obtains the highest Mcc value (0.254).At the Y site, our method obtains the Auc value of 0.625, second only to the Auc value of Specifically

Table 2 .
Performance comparison on unbalanced mouse dataset.(a), (b), and (c) denote the comparison of the prediction results of Attenphos with the state-of-the-art deep learning prediction tools at the S site, T site, and Y site, respectively.
Data in bold indicates that the model performs best for that evaluation metric.For each column, deeper colors mean that the model performs better on that metric.
Data in bold indicates that the model performs best for that evaluation metric.For each column, deeper colors mean that the model performs better on that metric.

Table 3 .
Ablation experiments of attention mechanisms in the Attenphos method.(a), (b), and (c) indicate the S, T, and Y sites.
Data in bold indicates that the model performs best for that evaluation metric.