A Machine Learning Method to Identify Umami Peptide Sequences by Using Multiplicative LSTM Embedded Features

Umami peptides enhance the umami taste of food and have good food processing properties, nutritional value, and numerous potential applications. Wet testing for the identification of umami peptides is a time-consuming and expensive process. Here, we report the iUmami-DRLF that uses a logistic regression (LR) method solely based on the deep learning pre-trained neural network feature extraction method, unified representation (UniRep based on multiplicative LSTM), for feature extraction from the peptide sequences. The findings demonstrate that deep learning representation learning significantly enhanced the capability of models in identifying umami peptides and predictive precision solely based on peptide sequence information. The newly validated taste sequences were also used to test the iUmami-DRLF and other predictors, and the result indicates that the iUmami-DRLF has better robustness and accuracy and remains valid at higher probability thresholds. The iUmami-DRLF method can aid further studies on enhancing the umami flavor of food for satisfying the need for an umami-flavored diet.


Introduction
Umami taste has been widely accepted as the fifth basic taste, along with the four other basic tastes of sweet, sour, salty, and bitter [1]. Umami substances are important for enhancing the flavor of food and healthy eating [2]. Umami peptides frequently contain aspartic acid, glutamic acid, asparagine, or glutamine residues. Peptides that contain these umami amino acids may or may not have an umami flavor, and may instead have a bitter flavor in the absence of umami amino acids [3]. Umami peptides are a novel class of umami agents with numerous potential uses and a distinctive flavor. Additionally, these peptides act synergistically with other umami compounds to enhance the sweetness of sweet items and the saltiness of salty items, but reduce sour and bitter tastes, thus softening the taste. Umami is a very important factor affecting the quality of food, and increasing the content of umami substances in the food improves the overall palatability.
Wet tests are costly and time-consuming when identifying umami peptides. The postgenomic era's proliferation of peptide sequence databases [4,5] has had a significant impact on the practical application of automated mathematical methods for the discovery of novel umami peptides. The development of umami peptide prediction tools using deep representation learning features has attracted increasing interest in the field of bioinformatics [6][7][8]. Umami-SCM [9] was developed in 2020 and uses the scoring card method (SCM). It is combined with the propensity score of amino acids and dipeptides for identifying umami peptides [10]. The independent test accuracy of this method was reported to be 0.865 and the predictor performed better in 10-fold cross-validation tests than in existing methods. Charoenkwan et al. developed UMPred-FRL in 2021 [11], which integrated seven different traditional feature codes for constructing the umami peptide classifier. Jiang et al. proposed iUP-BERT in 2022 [12], which is based on the use of a single deep representational learning feature encoding method (BERT: bidirectional encoder representations from transformer). Compared to Umami-SCM and UMPred-FRL, iUP-BERT has superior performance in both independent testing and cross-validation. Despite notable advancements in the field, particularly in terms of independent testing, machine learning (ML)-based umami peptide detection algorithms that rely exclusively on sequence data still need to significantly improve in terms of performance. The above predictors are still not accurate enough, and there is still room for improvement. For instance, we found that iUP-BERT was not as robust as expected.
Representation learning [13,14] comprises ML techniques that enable the automatic identification of representations from raw data for feature detection or classification. This eliminates the need for manual feature engineering and enables machines to learn the features of protein or peptide sequences, and apply them to perform specific tasks. During depth evaluation in representation learning, ML techniques are used for transforming data from the original representation to a new representation that preserves the information necessary for the object of interest [15][16][17][18][19][20][21][22], while discarding the redundant information [23][24][25][26][27][28][29][30][31]. Sequence-based deep representation learning has been recognized as an innovative and efficient construction in protein and peptide research for protein feature prediction [32][33][34][35][36][37][38][39], including the unified representation (UniRep) method [40] and BiLSTM [41].
In this study, sequence-based unified representation (UniRep) features based on multiplicative LSTM were solely used for developing an ML-based model, iUmami-DRLF, for the identification of umami peptides. iUmami-DRLF showed exceptional outcomes in the independent tests and 10-fold cross-validation studies. The obtained results had high accuracy, and more importantly, the results of independent testing proved iUmami-DRLF to be far superior to the current techniques and conventional non-deep representation learning techniques. Additionally, iUmami-DRLF has a wider range of applications and excellent umami peptide discrimination potential. The iUmami-DRLF predictor outperformed the conventional forecasting techniques in the 10-fold cross-validation tests (Sn = 0.959 and auROC = 0.957) and independent tests (ACC = 0.921, MCC = 0.815, Sn = 0.821, Sp = 0.967, auROC = 0.956, and BACC = 0.894). The independent test accuracy of iUmami-DRLF is improved by 2.45% as compared with that of iUP-BERT. The effects of various feature analysis methods and various deep representation learning features for classification results were examined using the unified manifold approximation and projection (UMAP) dimensionality reduction approach. Compared with other SOTA methods, iUmami-DRLF in this study has higher accuracy under various probability thresholds, and shows better robustness and generalization performance. The steps performed for the construction of iUmami-DRLF are depicted in Figure 1. Overview of model development. The pre-trained UniRep sequence embedding model was used to embed the peptide sequences into eigenvectors. The peptide sequences were converted into 1900-dimensional (D) UniRep eigenvectors. The synthetic minority over-sampling technique (SMOTE) was used for balancing the imbalanced data. These features were used as inputs to the knearest neighbors (KNN), logistic regression (LR), support vector machine (SVM), random forest (RF), and light gradient boosting machine (LGBM) predictor algorithms. Feature extraction was performed for model optimization using analysis of variance (ANOVA), LGBM, and mutual information (MI). The selected feature sets were subjected to another round of analysis using the three feature extraction algorithms and various hyperparameters. The final optimized model was developed by comparison of model performance in 10-fold cross-validation and independent tests. Based on the 91 wet-test validated umami peptide sequences reported in the latest research (UMP-VERI-FIED), we evaluated iUmami-DRLF in comparison to state-of-the-art methods. Overview of model development. The pre-trained UniRep sequence embedding model was used to embed the peptide sequences into eigenvectors. The peptide sequences were converted into 1900-dimensional (D) UniRep eigenvectors. The synthetic minority over-sampling technique (SMOTE) was used for balancing the imbalanced data. These features were used as inputs to the knearest neighbors (KNN), logistic regression (LR), support vector machine (SVM), random forest (RF), and light gradient boosting machine (LGBM) predictor algorithms. Feature extraction was performed for model optimization using analysis of variance (ANOVA), LGBM, and mutual information (MI). The selected feature sets were subjected to another round of analysis using the three feature extraction algorithms and various hyperparameters. The final optimized model was developed by comparison of model performance in 10-fold cross-validation and independent tests. Based on the 91 wet-test validated umami peptide sequences reported in the latest research (UMP-VERIFIED), we evaluated iUmami-DRLF in comparison to state-of-the-art methods.

Benchmark Dataset
In this work, the model was developed using the updated benchmark dataset from iUmami-SCM [9], which also facilitates future comparisons. The BIOPEP-UWM [4] database and experimentally verified umami peptides were included in the positive dataset, while bitter non-umami peptides were included in the negative dataset. The UMP442 benchmark dataset, which contains 304 non-umami peptides and 140 umami peptides, is acquired after data cleaning. To avoid the prediction model becoming overfit, the dataset was arbitrarily split into a training subset UMP-TR, and an independent test peptide subset denoted as UMP-IND. The UMP-TR dataset comprised 112 umami and 241 non-umami peptides, while the UMP-IND dataset comprised 28 umami and 61 non-umami peptides. The URL for both datasets is http://public.aibiochem.net/peptides/iUmami-DRLF/ (accessed on 1 April 2023). To validate the accuracy and robustness of our model, we also collected 91 wet-experiment verified umami peptide sequences from the latest research (please see Supplementary Table S2). The 91 wet-experiment verified umami peptides dataset was named UMP-VERIFIED.

Feature Extraction
For UniRep [40], a total of 24 million core amino acid sequences from UniRef50 were used for training the UniRep model. By identifying the subsequent amino acid by reducing cross-entropy losses, the model learns how to accurately express proteins after training. Using the trained model, the input sequence was represented as a single fixed-length vector (hidden state) with a multiplicative long short-term memory (mLSTM) encoder. The ideal ML model was trained using the output vector representation. Supervised learning is achieved in various bioinformatics tasks by using the input sequence as a personality.
First, a matrix containing the sequences of S amino acid residues was integrated using the single thermal code R S×10 . The matrix was then put through into the mLSTM encoder to generate an output hidden state of R 1900×S as an embedding matrix. The 1900-dimensional (D) UniRep feature vector was finally derived using an average pooling operation.
The equations used by the mLSTM encoder for performing the calculations are provided hereafter (Equations (1)-(7)). Where m t represents the current intermediate multiplication state,ĥ t is the input before the hidden state, f t represents the forgotten gate, i t represents the input gate, o t stands the output gate, h t stands the hidden state for output, and C t is the current unit state.
In this example, stands for element-by-element multiplication, X t represents the current input, h t−1 remains for the previous hidden state, C t−1 represents the previous unit state, σ stands for a sigmoid function, and tanh represents a tangent function.

Balancing Strategy
Classifiers were built from unbalanced datasets using the synthetic minority oversampling technique (SMOTE) [33] methodology. SMOTE is an improved method based on the random oversampling algorithm [42], and primarily combines the analysis of minority class samples, the location of nearby samples, and the creation of artificially created new samples in accordance with the minority class samples. SMOTE first identifies the neighboring samples for all minority class samples using the k-nearest neighbors (KNN) algorithm, and then uses linear random interpolation for realizing sample synthesis. A random interpolation position is selected among the samples, and an equal number of interpolations are considered for each sample point. Such a balancing strategy for achieving data balance not only increases the sample size but also improves sample quality. Classifiers can learn more distinct features after processing with SMOTE, which significantly improves the performance of classifiers.

Feature Selection Strategy
We used three feature selection techniques, namely, analysis of variance (ANOVA) [43,44], light gradient boosting machine (LGBM) [6,45], and mutual information (MI) [46], for selecting the retrieved features. These techniques were employed in this study for determining the best feature space, and ranking the features based on their relevant ratings. The features with importance values larger than a crucial threshold (average feature importance value) were selected after sorting the features from the "largest" to the "smallest" based on the importance values.

Analysis of Variance (ANOVA)
In this study, the features were sorted in order of importance using the ANOVA score. The mean difference between groups can be efficiently evaluated using ANOVA, which computes the ratio of variance within groups to the variance between groups for each feature [47]. The following formula was used for determining the ANOVA score: where S(t) represents the score of the feature t, S 2 θ (t) stands for the variance between groups, and S 2 ω (t) is the variance within groups. The formulae used for calculating S 2 θ (t) and S 2 ω (t) are as follows: where K denotes the quantity of groups, N denotes the entire quantity of instances, and f t (i, j) denotes the value of the j-th sample in the i-th group of the feature t.

Lighting Gradient Boosting Machine (LGBM)
LGBM [33] is a quick, dispersed, strong gradient boosting framework based on a decision tree technique that is employed in numerous ML applications, including classification and ranking. The gradient boosting decision tree (GBDT), which has the ability to learn the performances of learners, is continuously improving with several computational iterations. Here, we define h c (x) as an estimated function in Equation (11) and evaluate the loss function in Equation (12): where c means the current iteration, and F c−n (x) means the last n iterations' model achievement. The following formula is used to select the most potential features in the current iteration, and the importance of each feature is obtained by ranking.

Mutual Information (MI)
MI has been widely used for feature selection since its development [48]. The advantage of MI in feature selection lies in its ability to equivalently define multidimensional variables and detect nonlinear relationships between variables. Owing to these advantages, the MI method can fully consider the joint correlation and redundancy of features during feature selection [49].
The entropy estimate for the peptide sequence S is provided in Equation (14): Using this entropy equation, the equation for the MI peptide sequence was deduced as: where ∑ U is the alphabet of amino acid residues and P(ε i ) is the marginal probability of residue i.

Machine Learning Methods
Five widely used high-performance ML methods were used in this study, namely, KNN, linear regression (LR), support vector machine (SVM), random forest (RF), and LGBM [50].
KNN is one of the most straightforward machine learning algorithms that is better suited for automatic class categorization in studies with high sample sizes. Data are said to belong to a class if the minority of the K most comparable data in the feature space, or the feature space's closest neighbors, also do. This approach only selects the class of the data to be based mainly on the classification of the data or the data nearest to it.
LR is categorized as a supervised learning method in ML. The concept of LR is that if data obey a certain distribution, then the parameters are estimated by maximum likelihood estimation. This method is actually a classification model and is often used for binary and multi-class classification problems. It is widely used owing to its simplicity, parallelizability, and strong interpretability.
SVM is applied for solving binary classification problems in bioinformatics. RF is a bagging-based technique that uses random feature selection during node splitting in addition to sampling at random.
LGBM is a gradient boosting framework that employs methods for learning from trees.

Evaluation Metrics and Methods
Five widely used measures were used for evaluating the performance of the models, and were calculated using Equations (16)-(20): where TP denotes the amount of umami peptides successfully identified as umami, and TN denotes the quantity of non-peptides successfully identified as non-umami. FP denotes the amount of non-umami peptides falsely identified as umami, while FN denotes the quantity of umami peptides incorrectly identified as non-umami. The developed models were also contrasted with one another and with previously stated models based on the receiver operating characteristic curve (ROC). The area under the ROC curve (auROC) was also used for evaluating the predictive performance, where the values of auROC ranging between 0.5 and 1 stand for random and perfect models, respectively. The BACC approach is used for describing data imbalances, and the values of ACC and BACC are equal in a balanced sample. K-fold cross-validation and independent testing methods are commonly used to evaluate ML models [51]. The raw data are separated into k-folds in K-fold cross-validation. The remaining K −1 subsets are utilized as training sets, while one subset is used for model validation. In the validation set, K models are evaluated separately, and the final values of the evaluation measures are averaged to obtain the cross-validated values. In this investigation, we employed the 10-fold (K = 10) cross-validation approach. The samples used in stand-alone testing were fresh for the trained model, and the test dataset used was completely different from the training set.

Cross-Entropy Loss
When performing a binary classification task, there are only positive and negative examples, and their probabilities add up to 1. Therefore, we simply need to predict a probability rather than a vector.
The loss function is defined simply as follows: where y is the sample label, which takes the value of 1 if the sample is a positive case and 0 otherwise, andŷ is the probability that the model predicts that the sample is a positive case. In general, the lower the value of the cross-entropy loss function, the higher the classification effect [52][53][54][55].

Effect of SMOTE
We first extracted a 1900-dimensional feature vector using UniRep. The model was developed and initially trained using five different ML techniques, namely, KNN, LR, SVM, LGBM, and RF, for investigating the effect of SMOTE on the automatic identification of umami peptides. The outcomes of independent testing and 10-fold cross-validation of the five ML models optimized with SMOTE and five ML models optimized without SMOTE were obtained based on the aforementioned hypotheses, and are depicted in Figure 2 and Supplementary Table S1. The values in the tables and figures indicate model performance measures following the optimization of model parameters. Figure 2. Results of 10-fold cross-validation (A) and independent testing (B) of the five ML models balanced with SMOTE and the five ML models balanced without SMOTE.As illustrated in Figure 2 and Supplementary Table S1, the features of models following optimization with SMOTE were clearly superior to the features of models developed without SMOTE optimization. Using the LR-based prediction model as an example, the LR-SMOTE model outperformed or equaled the LR model without SMOTE optimization in 66.7% of the metrics in 10-fold cross-validation and independent tests. Of the SVM-based models, the SVM-SMOTE model outperformed the SVM model developed without SMOTE optimization in 83.3% of the indicators.
In some models, the Sp values were high while the Sn and other indicators were very poor owing to the bias of the unbalanced dataset bias towards the negative class, which negatively affected the recognition ability of the positive class. These findings also emphasize the value and significance of optimizing imbalanced datasets using SMOTE. However, it can be inferred from the UMAP display in Figure 3 that the improvement in the datasets using SMOTE improved the predictive ability of the models in identifying umami peptides.

Effects of Different ML Models
The results of Section 3.1 revealed that the SMOTE algorithm optimized the unb anced data to a certain extent. The consequents of 10-fold cross-validation and indepe ent tests of the models created using SMOTE-balanced features with the five ML al rithms are depicted in Table 1. As depicted in Table 1, the recognition of umami peptides by the LR model outp formed that of the other ML models in 66.7% of the metrics. The consequents of 10-f cross-validation revealed that the LR model, iUmami-DRLF, exceeded all other ML m

Effects of Different ML Models
The results of Section 3.1 revealed that the SMOTE algorithm optimized the unbalanced data to a certain extent. The consequents of 10-fold cross-validation and independent tests of the models created using SMOTE-balanced features with the five ML algorithms are depicted in Table 1. Table 1. Results of 10-fold cross-validation and independent testing based on the five ML algorithms developed using SMOTE-balanced features. As depicted in Table 1, the recognition of umami peptides by the LR model outperformed that of the other ML models in 66.7% of the metrics. The consequents of 10-fold cross-validation revealed that the LR model, iUmami-DRLF, exceeded all other ML models in four metrics. The ACC and BACC of iUmami-DRLF were 0.22-6.97% superior to that of the other models, while the MCC and Sn increased by 0.71-16.51% and 0.85-29.27%, respectively. The results of the independent tests revealed that the LR model, iUmami-DRLF, outscored the other ML models in four metrics. The ACC, MCC, auROC, and BACC efficiency levels of iUmami-DRLF were superior to those of the other models by 0.95-19.13%, 2.67-153.10%, 2.32-17.62%, and 0.49-48.82%, respectively. Although the SVM model achieved the best indicators for the identification of umami peptides in certain aspects, the results of the independent tests revealed that the SVM model will show more unbalanced data (MCC = 0.258, Sn = 0.100, and BACC = 0.549). We, therefore, selected the LR model for developing the umami peptide predictor. Additionally, the results of the 10-fold cross-validation of the five models revealed that the values of ACC and BACC were equal, indicating that the dataset was balanced following optimization with SMOTE. The equal values of ACC and BACC have been indicated in blue in Table 1.

Effects of Different Feature Selection Methods
As described in Section 3.1, the balanced SMOTE-optimized data encoding method significantly outperformed the unprocessed data encoding approach in the tests. The feature vector that was recovered using UniRep had 1900 dimensions as opposed to the 353 dimensions of the sequence vector that was used in the training set. The use of high-dimensional feature vectors frequently leads to over-fitting or redundancy of feature information. In order to solve this issue, we used three feature selection methods, namely, ANOVA, LGBM, and MI, for selecting the high-dimensional feature vectors. An incremental feature strategy and a hyperparameter grid search approach were employed in this study, and the GridSearchCV module in the scikit-learn library was used for searching the hyperparameters for each model. Table 2 summarizes the outcomes of 10-fold crossvalidation and independent testing of the five ML models developed based on the UniRep features selected using the three feature selection methods. The results of independent testing of the aforementioned models with selected features and the models without selected features are compared in Figure 4.  The outcomes of the independent testing are shown in Figure 4, which amply demonstrates that the chosen fusion feature sets outperformed the unselected fusion features. In the independent tests, the Sp of the 1900D models without feature selection was lower than all the models with feature selection, with the exception of the SVM-based model (5.00-8.75% higher). These results clearly demonstrated that the selection of feature descriptors effectively resolves information redundancy, and helps optimize the prediction performance of the umami peptide prediction model. Figure 4 and Table 2 clearly depict that of the three feature selection methods, and the overall performance of LGBM was superior to that of the other feature selection methods used for the identification of umami peptides. Considering the LR model as an example, the LGBM feature selection method outperformed the other methods (ANOVA and MI) in all six metrics in the 10-fold crossvalidation studies. The performance of ACC, MCC, Sn, Sp, auROC, and BACC efficiency improved by 4.17-4.88%, 9.78-11.65%, 5.04-7.03%, 2.88-3.36%, 1.59-2.03%, and 4.17-4.88%, respectively, when the LGBM method was used. The LGBM feature selection method outperformed ANOVA and MI in the independent tests in five metrics. The performance of ACC, MCC, Sp, auROC, and BACC synergy improved by 2.45-3.72%, 6.12-11.19%, 1.68-5.34%, 2.80-10.65%, and 0.68-5.18%, respectively, when the LGBM method was used.
Based on the aforementioned results (Sections 3.1-3.3), we believe that the LR model developed based on the first 177 dimensions of UniRep using SMOTE-optimized data was superior in predicting umami peptides, and corroborates with the results of visual analysis discussed hereafter in Section 3.4. Based on the aforementioned analyses, the first 177D features of UniRep were selected for constructing the iUmami-DRLF predictor based on the LGBM model for subsequent studies.

Comparison with Existing Methods
In order to evaluate the efficacy and application of our technique in comparison to other predictors, we assessed and compared the predictive performance of iUmami-DRLF with that of other methods, including iUmami-SCM and UMPred-FRL. Table 3 compares the results of 10-fold cross-validation and independent testing of iUmami-DRLF with those of other existing methods. Table 3. Results of 10-fold cross-validation and independent testing of iUmami-DRLF and other existing methods.  Comparison of the two iUmami-DRLF predictors revealed that the results of the 10-fold validation of iUmami-DRLF(LR) were slightly worse than the results for the SVM model (ACC and BACC, MCC, Sn, Sp, and auROC were 2.02%, 4.31%, 1.30%, 2.79%, and 2.37% lower, respectively). However, the results of independent testing were superior for iUmami-DRLF(LR) (ACC, MCC, Sp, auROC, and BACC were 3.66%, 9.25%, 5.08%, 4.53%, and 2.75% higher, respectively). This revealed that the generalization ability of LR was stronger. The results of comparative analyses demonstrated the superiority of iUmami-DRLF in umami peptide prediction. The umami prediction ability of iUmami-DRLF was more reliable than the existing methods.

Feature Visualization
Feature visualization can intuitively convey feature information through images to clearly represent the dataset. UMAP is a popular uniform approximation projection algorithm for dimensionality reduction, and was used in this study for visual analyses of the features in the umami peptide dataset. The differences in feature representation are clearly highlighted in UMAP visualization. The results of dimensionality reduction for feature visualization with UMAP are depicted in Figure 3. Figure 3 demonstrates that compared with the UniRep feature vector without SMOTE optimization ( Figure 3A), the SMOTE-optimized 1900D UniRep feature vector ( Figure 3B) was better at distinguishing umami peptides from non-umami peptides. Compared with the SMOTE-optimized UniRep features ( Figure 3B), the top 177D features ( Figure 3C) and top 121D features of UniRep ( Figure 3D) were further optimized after feature selection.

Web Server Development
For other researchers to anticipate umami peptides, we created the user-friendly iUmami-DRLF web server, which is freely accessible online at https://www.aibiochem. net/servers/iUmami-DRLF/ (accessed on 1 April 2023). The web server is easy to use. The user only needs to enter the peptide sequence in the text box, click the run button, and wait for a few minutes, then the user can identify and judge whether the input peptide sequence is an umami peptide, and the result will be displayed on the web page. The output results include the input sequence, whether it is an umami peptide, and the confidence level. See the web server interface on the website or Supplementary Figures S1-S3. Additionally, please contact the corresponding authors if users need to predict a significant number of sequences.

Methods' Robustness
To further verify the effectiveness and robustness of the model, we collected 91 wetexperiment verified umami peptide sequences reported in the latest literature [56][57][58][59][60][61][62][63][64][65][66][67][68][69][70]. These empirical umami peptide sequences constituted the dataset UMP-VERIFIED, which was then used to test state-of-the-art methods, including UMPred-FRL [11] and iUP-BERT [12] for comparison to iUmami-DRLF. Here, the accuracy of models under different prediction probability threshold conditions was adopted to make comparisons. The probability threshold T referred to the fact that, for a peptide sequence, if the probability threshold predicted by the machine learning model was greater than T, the model would determine that the sequence was an umami peptide; otherwise, it was a non-umami peptide. Figure 5A shows the relationship between the accuracy of the three models and the probability threshold. It can be seen from the figure that our model iUmami-DRLF has the best accuracy under any probability threshold. It was particularly noteworthy that the accuracy rate of iUP-BERT is 0 at 95% threshold probability, indicating that the model has failed. While the value of iUmami-DRLF is 52.7%, which is nearly six times the UMPred-FRL accuracy (8.8%), when the probability threshold was set to 99%, the prediction accuracy of iUP-BERT and UMPred-FRL is 0. It meant that both methods were invalid. In sharp contrast, iUmami-DRLF can still maintain the prediction accuracy of 40.7%, and the model still worked. These results proved that iUmami-DRLF is with better robustness and better model generalization performance than other methods. is the cross-entropy loss of the predicted outcome about the probability threshold. The smaller the cross-entropy loss, the better the robustness and accuracy of the model. Note that at the probability thresholds of 95% and 99%, the prediction accuracy of iUP-BERT and UMPred-FRL is 0, and their corresponding cross-entropy losses can be calculated, but they are not meaningful.
The robustness and effectiveness of iUmami-DRLF come from the fact that it was an optimized model with minimum cross-entropy loss. For binary classification machine learning models, the closer the prediction output is to the real sample label, the smaller the cross-entropy loss is, resulting in better accuracy [71].
It could be proven by the data shown in Figure 5B. Figure 5B displays models' crossentropy loss under different probability thresholds. Obviously, iUmami-DRLF has the minimum cross entropy loss of the three models under the probability threshold of 50%, 70%, and 85%. At the probability threshold of 95%, the cross-entropy loss of iUmami-DRLF is significantly smaller than that of UMPred-FRL. For 95% and 99% probability thresholds, UMPred-FRL and iUP-BERT models have failed, and the calculated cross-entropy is meaningless. For example, the cross-entropy loss of iUP-BERT remains unchanged in probability thresholds of 95% and 99% of cases.

Conclusions and Future Work
In this research, we proposed a predictor, iUmami-DRLF, for the successful prediction of umami peptides solely based on sequence information. The imbalanced dataset was processed with SMOTE, and the latent umami peptide information was obtained using the UniRep deep representation learning feature embedding approach. Our predictor was strengthened by the use of three feature selection techniques, namely, LGBM, ANOVA, and MI, and the combination of five ML algorithms (KNN, LR, SVM, RF, and LGBM) for model development. Following testing and optimization, the top 177D features of UniRep were selected as the optimal feature set, and then integrated with the LR model for developing the final predictor. The results of 10-fold cross-validation and independent testing revealed that iUmami-DRLF markedly outperformed the existing methods in the independent tests. The latest umami peptide sequences verified by wet experiment were used to validate the method, and the results show that iUmami-DRLF could more reliably, robustly and accurately predict (independent tests: ACC = 0.921, MCC = 0.815, Sn = 0.821, Sp = 0.967, auROC = 0.956) umami peptides than the reported state-of-the-art methods. It hopes that the user-friendly webserver could be useful for researchers in the area. The following areas can still be improved, despite the fact that iUmami-DRLF has significantly increased the accuracy of umami peptide prediction: First, as our feature extraction model requires lots of computation, webservers without GPU configuration will take a long time to complete this task. Users can contact the corresponding authors if they need to predict a large number of sequences. Furthermore, using the most recent empirical data when training the model might produce better outcomes. Ultimately, using the method of model distillation can simplify the feature extraction model and lessen its computational complexity.

Data Availability Statement:
The data used to support the findings of this study can be made available by the corresponding author upon request.