Identify Bitter Peptides by Using Deep Representation Learning Features

A bitter taste often identifies hazardous compounds and it is generally avoided by most animals and humans. Bitterness of hydrolyzed proteins is caused by the presence of bitter peptides. To improve palatability, bitter peptides need to be identified experimentally in a time-consuming and expensive process, before they can be removed or degraded. Here, we report the development of a machine learning prediction method, iBitter-DRLF, which is based on a deep learning pre-trained neural network feature extraction method. It uses three sequence embedding techniques, soft symmetric alignment (SSA), unified representation (UniRep), and bidirectional long short-term memory (BiLSTM). These were initially combined into various machine learning algorithms to build several models. After optimization, the combined features of UniRep and BiLSTM were finally selected, and the model was built in combination with a light gradient boosting machine (LGBM). The results showed that the use of deep representation learning greatly improves the ability of the model to identify bitter peptides, achieving accurate prediction based on peptide sequence data alone. By helping to identify bitter peptides, iBitter-DRLF can help research into improving the palatability of peptide therapeutics and dietary supplements in the future. A webserver is available, too.


Introduction
Humans and most animals instinctively dislike bitter substances, as the taste often identifies toxic compounds. However, some beneficial nutrients, such as soy products, endive, and other Asteraceae vegetables, as well as certain therapeutic peptides, are often bitter. Proteins can be enzymatically digested into shorter polypeptides that have certain beneficial biological activities. Studies have shown that hydrolyzed polypeptides have good nutritional properties and can be easily absorbed and utilized. However, hydrolysis often produces peptides with varying degrees of bitterness that can be detected even at very low concentrations [1]. The bitter taste of protein hydrolysates is caused by the presence of peptides containing hydrophobic amino acids. Most of these peptides are typically composed of no more than eight amino acids and few contain more than ten. However, bitter peptides containing up to 39 amino acids have been described [2]. The bitter taste of protein hydrolysates is the result of a variety of factors. Hydrophobic amino acids within the polypeptide tend to become exposed, stimulating the taste buds, causing the bitterness. Generally, the more hydrophobic amino acids are exposed, the stronger the bitter taste. In addition, the length, overall hydrophobicity, sequence, and amino composition of a polypeptide chain also have a significant impact on bitterness [3].
Identifying bitter peptides using conventional laboratory approaches is expensive and time-consuming. With the availability of a large number of peptide sequence databases in LGBM predictor algorithms. The model was optimized by feature selection using the LGBM method. Selected feature sets were subjected to another round of analysis using three algorithms and various hyperparameters. Through 10-fold cross-validation and comparison of independent tests results, the optimized final model was developed. Here, the example like SSA-BiLSTM means two kind of features are combined.

Results of Preliminary Optimization
To explore embedded features that are useful in identifying bitter peptides, we first used three deep representation learning feature extraction methods, soft symmetric alignment (SSA), unified representation (UniRep), and bidirectional long short-term memory (BiLSTM). To generate each of these, we used three distinct machine learning methods, SVM, LGBM, and RF, to develop models and carry out their initial optimization. Table 1 shows the results of 10-fold cross-validation and independent tests for the three models LGBM predictor algorithms. The model was optimized by feature selection using the LGBM method. Selected feature sets were subjected to another round of analysis using three algorithms and various hyperparameters. Through 10-fold cross-validation and comparison of independent tests results, the optimized final model was developed. Here, the example like SSA-BiLSTM means two kind of features are combined.

Results of Preliminary Optimization
To explore embedded features that are useful in identifying bitter peptides, we first used three deep representation learning feature extraction methods, soft symmetric alignment (SSA), unified representation (UniRep), and bidirectional long short-term memory (BiLSTM). To generate each of these, we used three distinct machine learning methods, SVM, LGBM, and RF, to develop models and carry out their initial optimization. Table 1 shows the results of 10-fold cross-validation and independent tests for the three models developed based on the above assumptions. The values in the table represent model performance measures after the optimization of model parameters. The best values achieved for individual features are shown in bold and are underlined. As shown in Table 1, the 10-fold cross-validation results of the UniRep feature vector, developed using SVM, performed the best of all tested feature/model combinations (accuracy (ACC) = 0.865, Matthews correlation coefficient (MCC) = 0.730, sensitivity (Sn) = 0.867, specificity (Sp) = 0.863, F1 = 0.865, area under the PRC curve (auPRC) = 0.937, and area under the ROC curve (auROC) = 0.931). The ACC of this feature vector exceeded other options by 1.17-9.91%. MCC was better by 2.67-26.69%, Sn by 0.46-6.25%, Sp by 1.77-14.46%, F1 by 0.98-9.08%, and auROC by 0.54-7.63%. Regarding its performance in independent tests (ACC = 0.867, MCC = 0.735, Sn = 0.844, Sp = 0.891, F1 = 0.864, auPRC = 0.952, auROC = 0.948), ACC was 1.85% lower, MCC was reduced by 4.22%, Sn was reduced by 7.35%, F1 was reduced by 2.49%, and auPRC by 0.47% compared to the BiLSTM feature vector developed based on SVM. It can be concluded that, for the identification of bitter peptides, UniRep features were superior to BiLSTM features.

The Effects of Feature Fusion on the Automatic Identification of Bitter Peptides
At this stage of the development work, we evaluated the use of pairwise combinations of features to generate fusion features. Features were combined in all possible pairings, namely, SSA + UniRep, SSA + BiLSTM, and UniRep + BiLSTM. In addition, SSA, UniRep, and BiLSTM were combined to obtain a triple fusion feature, SSA + UniRep + BiLSTM. These fusion feature combinations were used as input into SVM, LGBM, and RF algorithms to train predictive models and to optimize model performance ( Figure 2). Table 2 shows the 10-fold cross-validation and independent test results for all the developed models. The values in this table represent the results after optimizing model parameters. Again, the best performance metrics values are shown in bold and are underlined.
Comparing Table 1 (see Section 3.1) and Table 2, it is immediately apparent that the optimal performance values of the models using fusion features are better than the best values obtained with non-combined features. As an example, the performance of the combination of the 121-dimensional SSA feature and the 5505-dimensional BiLSTM, SSA + BiLSTM, developed using RF, showed an ACC value of 0.898, while the ACC of the SSA feature alone was 0.820, representing a 9.51% better performance of the fusion feature.  Comparing Table 1 (see Section 3.1) and Table 2, it is immediately apparent that the optimal performance values of the models using fusion features are better than the best values obtained with non-combined features. As an example, the performance of the combination of the 121-dimensional SSA feature and the 5505-dimensional BiLSTM, SSA + BiLSTM, developed using RF, showed an ACC value of 0.898, while the ACC of the SSA feature alone was 0.820, representing a 9.51% better performance of the fusion feature.

The Effects of Feature Selection on the Automatic Identification of Bitter Peptides
As described in previous section, fused feature encoding was clearly superior to nonfused feature encoding. The sequence vector used in the training set had 512 dimensions,

The Effects of Feature Selection on the Automatic Identification of Bitter Peptides
As described in previous section, fused feature encoding was clearly superior to nonfused feature encoding. The sequence vector used in the training set had 512 dimensions, while the feature vectors based on the combined fusion feature scheme had 2021, 3726, 5505, and 5626 dimensions. This high number of dimensions increases the risk of redundancy and the overfitting of feature information. To resolve this problem, we used the LGBM algorithm for feature selection, while using an incremental feature strategy and a hyperparametric mesh search method. For the latter, we selected the scikit-learn GridSearchCV module to perform the hyperparameter search for each model. The performance metrics of each individual, double, and triple fused feature developed using all three machine learning models (SVM, LGBM, RF) are summarized in Table 3, while a visual representation of the outcomes is shown in Figure 3. The outcome measures shown in Figure 3 and Table 3 clearly indicate that the selected fusion feature sets performed significantly better than unselected fusion features. It is apparent from these results that the overall performance of the 106D UniRep + BiLSTM feature vector was better than outcomes achieved with any other feature vector.  The outcome measures shown in Figure 3 and Table 3 clearly indicate that the selected fusion feature sets performed significantly better than unselected fusion features. It is apparent from these results that the overall performance of the 106D UniRep + BiLSTM feature vector was better than outcomes achieved with any other feature vector. The results of 10- These results clearly show that selecting feature descriptors is an effective way of resolving problems with information redundancy and was beneficial in optimizing the prediction performance of the bitter peptide prediction model.

The Effect of Machine Learning Model Parameter Optimization on the Automated Identification of Bitter Peptides
It is apparent from Table 3 that the overall performance of the SSA + UniRep_106 feature set was superior to all other options. There were only two isolated exceptions to this statement. The ten-fold cross-validation result developed based on LGBM with an auROC = 0.957 and the Sn of SSA + UniRep + BiLSTM_336 feature developed based on SVM (Sn = 0.953) were marginally better. Although these measures are 0.52% and 3.36% higher than those achieved using the UniRep + BiLSTM_106 feature, we believe that the These results clearly show that selecting feature descriptors is an effective way of resolving problems with information redundancy and was beneficial in optimizing the prediction performance of the bitter peptide prediction model.

The Effect of Machine Learning Model Parameter Optimization on the Automated Identification of Bitter Peptides
It is apparent from Table 3 that the overall performance of the SSA + UniRep_106 feature set was superior to all other options. There were only two isolated exceptions to this statement. The ten-fold cross-validation result developed based on LGBM with an auROC = 0.957 and the Sn of SSA + UniRep + BiLSTM_336 feature developed based on SVM (Sn = 0.953) were marginally better. Although these measures are 0.52% and 3.36% higher than those achieved using the UniRep + BiLSTM_106 feature, we believe that the UniRep + BiLSTM_106 feature developed using LGBM provided the best overall performance. Therefore, the UniRep + BiLSTM feature set was selected for final development, using three different machine learning methods to build models.
We utilized the scikit-learn GridSearchCV module to perform a hyperparameter search on each model, recording the corresponding optimal hyperparameters for each, and comparing them with default parameters. The observed values are shown in Figure 4 and in Supplemental Table S1. performance. Therefore, the UniRep + BiLSTM feature set was selected for final develop ment, using three different machine learning methods to build models.
We utilized the scikit-learn GridSearchCV module to perform a hyperparamete search on each model, recording the corresponding optimal hyperparameters for each and comparing them with default parameters. The observed values are shown in Figur 4 and in Supplemental Table S1. As shown in Figure 4 and Supplemental Figure S1, when the UnRep + BiLSTM fea ture prediction was run using different algorithms and hyperparameters, the best perfor mance was seen with the RF (Nleaf = 2, n_estimators = 300) model and the LGBM (depth = 3, n_estimators = 75) model. Although in the independent tests the Sn = 0.938 of the R model was marginally better (Sn was 1.73% higher), in every other respect, includin ACC, MCC, Sp, F1, auPRC, and auROC, the LGBM-based model showed clearly superio performance in both independent testing and 10-fold cross-validation.
Based on the analysis above, we selected the first 106D features of UniRep + BiLSTM to build the iBitter-DRLF predictor based on the LGBM model and selected the paramete depth = 3 and n_estimator = 75 values for further use. Although in 10-fold cross-validation the auPRC = 0.955 of LGBM model was marginally better (auPRC was 0.84% higher), in every other respect, the LGBM (depth = 3, n_estimators = 75) model showed clearly supe rior performance in both independent testing and 10-fold cross-validation.

Comparison with Existing Methods
We compared the predictive performance of iBitter-DRLF with existing methods, in cluding iBitter-Fuse [18], MIMML [20], iBitter-SCM [17], and BERT4Bitter [19] to asses the effectiveness and utility of our method against its competitors. Independent test re sults for iBitter-DRLF and the existing methods are compared in Table 4. These result clearly demonstrate that iBitter-DRLF has significantly better ACC, MCC, Sp, and auROC than existing methods. ACC outperformed other methods by 0.64-11.85%. MCC was 1.60 29.22% better, Sp was 4.16-15.76% higher, while auROC values were up by 1.35-8.08% These comparisons show that iBitter-DRLF is more reliable and stable than existing algo rithms in predicting the bitterness of peptides. As shown in Figure 4 and Supplemental Figure S1, when the UnRep + BiLSTM feature prediction was run using different algorithms and hyperparameters, the best performance was seen with the RF (Nleaf = 2, n_estimators = 300) model and the LGBM (depth = 3, n_estimators = 75) model. Although in the independent tests the Sn = 0.938 of the RF model was marginally better (Sn was 1.73% higher), in every other respect, including ACC, MCC, Sp, F1, auPRC, and auROC, the LGBM-based model showed clearly superior performance in both independent testing and 10-fold cross-validation.

Comparison with Existing Methods
We compared the predictive performance of iBitter-DRLF with existing methods, including iBitter-Fuse [18], MIMML [20], iBitter-SCM [17], and BERT4Bitter [19] to assess the effectiveness and utility of our method against its competitors. Independent test results for iBitter-DRLF and the existing methods are compared in Table 4. These results clearly demonstrate that iBitter-DRLF has significantly better ACC, MCC, Sp, and auROC, than existing methods. ACC outperformed other methods by 0.64-11.85%. MCC was 1.60-29.22% better, Sp was 4.16-15.76% higher, while auROC values were up by 1.35-8.08%. These comparisons show that iBitter-DRLF is more reliable and stable than existing algorithms in predicting the bitterness of peptides.

Feature Visualization of the Picric Peptide Automatic Recognition Effect
Feature visualizations can communicate key data and features through graphics and colors to enable better insight into complex datasets. UMAP is a consistent manifold approximation and projection for the reduction of dimensionality. This algorithm is also suitable for the visual analysis of peptide characteristics. Feature visualization analysis of the automatic recognition of bitter peptides carried out by the UMAP algorithm preserved the characteristics of the original data well while greatly reducing the dimensions of characteristics. Through UMAP feature visualization, differences in feature representation can be clearly shown. Furthermore, the reasons for performance improvements of the model after feature optimization can be explained. The visualization of dimension-reduced features achieved using UMAP is shown in Figure 5. Apparently as compared to Figure 5 A-C, the first 106 features of the UniRep and BiLSTM fusion function, shown in Figure 5D, can better discriminate bitter peptides from non-bitter ones.

Feature Visualization of the Picric Peptide Automatic Recognition Effect
Feature visualizations can communicate key data and features through graphics and colors to enable better insight into complex datasets. UMAP is a consistent manifold approximation and projection for the reduction of dimensionality. This algorithm is also suitable for the visual analysis of peptide characteristics. Feature visualization analysis of the automatic recognition of bitter peptides carried out by the UMAP algorithm preserved the characteristics of the original data well while greatly reducing the dimensions of characteristics. Through UMAP feature visualization, differences in feature representation can be clearly shown. Furthermore, the reasons for performance improvements of the model after feature optimization can be explained. The visualization of dimension-reduced features achieved using UMAP is shown in Figure 5. Apparently as compared to Figure 5 A-C, the first 106 features of the UniRep and BiLSTM fusion function, shown in Figure 5D, can better discriminate bitter peptides from non-bitter ones.

iBitter-DRLF Webserver
To facilitate the widespread use of our algorithm, we developed an iBitter-DRLF webserver that is freely available online at https://www.aibiochem.net/servers/iBitter-

iBitter-DRLF Webserver
To facilitate the widespread use of our algorithm, we developed an iBitter-DRLF webserver that is freely available online at https://www.aibiochem.net/servers/iBitter-DRLF/ (accessed on 1 May 2022) to other investigators for the prediction of bitter peptides. The webserver is easy to use. Just pasting the peptide sequences into the text box, clicking the run button and waiting for a few minutes, the results will be displayed in the web pages. Please see the webserver interface at the website or in Supplementary Figures S2-S4.

Benchmark Dataset
The updated benchmark dataset from iBitter-SCM [17] is utilized here for modeling and to make future comparisons easier. Both peptides constructed as non-bitter using the BIOPEP database [29] and those previously experimentally confirmed to be bitter were included in the datasets used in this study. There are 320 bitter peptides and 320 non-bitter peptides in the BTP640 benchmark dataset. The dataset was randomly split into a training subset known as BTP-CV and an independent subset of test peptides known as BTP-TS in order to prevent overfitting the prediction model. The BTV-CV and BTS-TS groups had peptide ratios of 4:1. As a result, while the BTP-TS dataset contains 64 peptides in each category, the BTP-CV dataset contains 256 bitter peptides and 256 non-bitter peptides. Users can obtain both datasets from https://www.aibiochem.net/servers/iBitter-DRLF/ (accessed on 1 May 2022).

Feature Extraction
To explore the effects of different features on bitter peptide recognition, we used three deep representation learning feature extraction methods, namely, SSA [30], UniRep [25], and BiLSTM [31]. Models were trained on an alternate dataset for the identification of bitter peptides. Different feature encoding schemes were compared to build more comprehensive predictive models.

Pre-Trained SSA Embedding Model
SSA defines a novel measure of similarity between sequences of arbitrary lengths embedded in vectors. First, a peptide sequence was used as input to a pre-trained model and encoded through a three-tier stacked BiLSTM encoder output. The final embedding matrix of each peptide sequence was obtained through a linear layer, R L×121 , where L is the length of the peptide. Such a model trained and optimized using SSA is referred to as an SSA-embedded model.
Suppose there are two embedded matrix of R L×121 named F 1 and F 2 for two different peptide sequences of differing lengths, called L 1 and L 2 .
where x i is a vector of 121D.
where y i is also a vector of 121D.
To calculate the similarity between two amino acid sequences represented as F 1 and F 2 separately, a soft symmetry alignment mechanism was developed, in which the similarity between the two sequences was calculated based on their embedded vector as follows: a ij is determined by the following Formulas (4)- (7).
These parameters are backfitted with the parameters of the sequence encoder by a fully differentiated SSA. The trained model converts the peptide sequence into an embedding matrix, R L×121 , and a 121D SSA feature vector is generated by averaging pooling operations.

Pre-Trained UniRep Embedding Model
The UniRep model was trained on 24 million UniRef50 primary amino acid sequences. The model performs next amino acid prediction by minimizing cross-entropy losses, thus learning how to represent proteins internally in the process. Using the trained model, a single fixed-length vector representation of the input sequence was generated by mLSTM (hidden state). The output vector representation was then trained into the best machine learning model. This characterizes the input sequence and enables supervised learning during different bioinformatics tasks.
First, the sequence with L amino acid residues was embedded into a matrix using a single thermal code, R L×10 . The matrix was then fed into the mLSTM encoder to obtain a hidden state output of R 1900×L as an embedding matrix. Finally, by an averaging pooling operation, the UniRep feature vector of 1900D was derived.
In these equations, ⊗ represents element-by-element multiplication, h t−1 represents the previous hidden state, X t is the current input, and m t is the current intermediate multiplication state.ĥ t represents the input before the hidden state, f t is the forgotten gate, i t is the input gate, and o t is the output gate. C t−1 is the previous unit state, C t is the current unit state, and h t is the hide state for output. σ is a sigmoid function, while tan h it is a tangent function.

Pre-Trained BiLSTM Embedding Model
BiLSTM is a combination of a forward LSTM and a backward LSTM that captures bidirectional sequence features better than either LSTM model. LSTM obtains the ability to calculate by forgetting and memorizing information. This propagates information that is useful for subsequent computation moments to pass through while discarding useless information and outputting hidden states at each time point. The forgetting memory and output are controlled by the forgetting gate, memory gate, and output gate. These gates are calculated from the hidden state of the previous moment and the current input.
Here X t is the current input, h t−1 represents the previous hidden state, C t is the current cell state, f t is the forgotten gate, i t is the input gate, o t is the output gate, C t−1 is the previous cell state, and h t is the output hidden state. Again, σ is a sigmoid function, while tan h is a tangent function.

Feature Fusion
To establish the best feature combination, first we combined the SSA eigenvector of 121D with the UniRep eigenvector of 1900D, obtaining the SSA + UniRep fusion feature vector, 2021D. Second, the SSA eigenvector of 121D was combined with the BiLSTM eigenvector of 3605D to obtain the 3726D SSA + BiLSTM fusion eigenvector. Third, the 1900D UnIrep eigenvector was combined with the 3605D BiLSTM eigenvector giving the 5505D UniRep + BiLSTM fusion feature vector. Finally, the 121D SSA eigenvector, the 1900D UniRep eigenvector, and the 3605D BiLSTM eigenvector were combined to obtain the 5626D SSA + UniRep + BiLSTM fusion eigenvector.

Feature Selection Method
LGBM is a gradient boosting framework that abandons the level-wise decision tree growth strategy used by most gradient boosting tools in favor of a leaf-wise algorithm with depth restrictions. In this project, LGBM was utilized to identify the optimal feature space and sort features based on their importance values. Data and data labels were entered into the LGBM model to fit the model before using the built-in functions of LGBM to obtain the importance value for each feature. Features were ranked from 'largest' to 'smallest' based on feature importance values and those with an importance value greater than the critical value (the average feature importance values) were selected.
SVM is a typical machine learning (ML) algorithm for dealing with binary classification problems in bioinformatics. We chose gamma and C in the range: a row vector of 30 elements logarithmically divided from 10 −4 to 10 4 . 'rbf' is the default kernel.
RF is a bagging-based algorithm that not only randomly selects samples, but also randomly selects features during the node splitting process. We selected the range of n estimators as (25,550) and the range of Nleaf as (2,12).
LGBM is a gradient boosting framework that uses tree-based learning algorithms. We selected the range of n estimators as (25,750) and the range of max_depth as (1, 12).

Evaluation Metrics and Methods
We utilized the following five widely used measures [37,38] to evaluate the performance of specific models, calculated as follows: (21)- (25).
In these equations, TP represents the number of bitter peptides correctly predicted to be bitter, TN is the number of non-bitter peptides correctly predicted as non-bitter. FP represents the number of non-bitter peptides erroneously predicted as bitter, while FN is the number of bitter peptides falsely predicted as non-bitter. Using the auROC, the proposed models could be compared with each other and with previously described models. The precision-recall curve is the line connecting the points of precision and recall. Using the auPRC to represent the area enclosed by the precision-recall curve and the x-axis. The area under the ROC curve was also used to evaluate predicted performance, where AUC values ranging from 0.5 to 1 represent stochastic and perfect models, respectively. K-fold cross-validation and independent testing are widely used to evaluate machine learning models. K-fold cross-validation divides the raw data into K-folds. One is used for the validation of each data subset, while the remaining K − 1 subset is used as the training set. In the validation set, K models are evaluated separately, and the final values of the model measures are averaged to obtain cross-validated values. During the work presented here, we used the 10-fold (K = 10) cross-validation method. For stand-alone testing, a completely different dataset from the training set was used, where all the samples were new to the trained model.

Conclusions
Here we describe the development of a new computational model called iBitter-DRLF that can accurately identify bitter peptides based on sequence data alone. It uses a deep representation learning feature embedding method to predict potential bitter peptides. As a result of extensive testing and optimization of multiple feature extraction approaches, using three distinct algorithms, we identified UniRep + BiLSTM_106 as the best fusion feature set. Additional feature selection, using LGBM classifier input, allowed us to develop a robust model. Results of the 10-fold cross-validation and the analysis of the results obtained through independent testing showed that iBitter-DRLF can effectively predict bitter peptides in protein hydrolysates or amongst artificially synthesized peptide therapeutics. Based on independent test results, iBitter-DRLF significantly outperformed existing predictors. Finally, to facilitate the use of the algorithm by other scientists, we built an iBitter-DRLF webserver. We hope that the use of iBitter-DRLF prediction of bitter peptides can improve adherence with taking nutritional supplements and peptide therapeutics in the future and advance drug development and nutrition research.
This work used deep representation learning [39,40] features to improve the predictive performance of the model. Although the exact physicochemical relevance of these features is unclear, this does not prevent the successful use of this method for computational predictions in peptide and protein sequence analysis.