IUP-BERT: Identification of Umami Peptides Based on BERT Features

Umami is an important widely-used taste component of food seasoning. Umami peptides are specific structural peptides endowing foods with a favorable umami taste. Laboratory approaches used to identify umami peptides are time-consuming and labor-intensive, which are not feasible for rapid screening. Here, we developed a novel peptide sequence-based umami peptide predictor, namely iUP-BERT, which was based on the deep learning pretrained neural network feature extraction method. After optimization, a single deep representation learning feature encoding method (BERT: bidirectional encoder representations from transformer) in conjugation with the synthetic minority over-sampling technique (SMOTE) and support vector machine (SVM) methods was adopted for model creation to generate predicted probabilistic scores of potential umami peptides. Further extensive empirical experiments on cross-validation and an independent test showed that iUP-BERT outperformed the existing methods with improvements, highlighting its effectiveness and robustness. Finally, an open-access iUP-BERT web server was built. To our knowledge, this is the first efficient sequence-based umami predictor created based on a single deep-learning pretrained neural network feature extraction method. By predicting umami peptides, iUP-BERT can help in further research to improve the palatability of dietary supplements in the future.


Introduction
Umami taste determines the deliciousness of foods. Many foods possess umami ingredients, such as meat products [1,2], mushroom [3], soy sauce [4], seafoods [5], and fermented foods [6]. In addition to sweet, bitter, salty, and sour, umami taste was recognized as the fifth taste, which is characterized as a meaty, savory, or broth-like flavor [7]. The perception of sweet, bitter and umami taste is inspired by the binding of taste components to the G protein-coupled receptor [8,9]. The main umami taste receptor is an independent heterodimeric T1R1/T1R3 receptor [10,11]. Umami ingredients are widely used in food production, with several health benefits [12]. Umami peptides are a group of specific structural peptides, which endow foods with a favorable umami taste [6]. The primary structure of umami peptides is usually short linear peptides, with a molecular weight distribution of less than 5000 Da. Dipeptides and tripeptides account for approximately 60% of the isolated umami peptides [3,10]. Longer linear peptides, including pentapeptides, hexapeptides, heptapeptides, and octapeptides, were also discovered to possess strong umami intensity [1,2,5,13]. The binding mechanism of umami peptides to the taste receptor was distinguished from that of other umami ingredients, indicating their special state-of-the-art performance was obtained for various downstream tasks [32]. With a global receptive field, BERT can effectively capture more global context information than the convolutional neural network-based models. Recently, BERT has achieved gratifying results in the prediction of various functional peptides, such as bitter peptides [33], antimicrobial peptides [34], and human leukocyte antigen peptides [35]. Soft symmetric alignment (SSA) has defined a brand-new method to compare arbitrary-length sequences within vectors [36]. An initial pretrained language model is used to encode a peptide sequence, as a three-tier stacked BiLSTM encoder output is meanwhile utilized. Each peptide sequence creates the final embedding matrix by employing a linear layer, R L×121 , in which L represents the peptide length. In the SSA embedded model, the model was trained and optimized using the SSA strategy [37,38].
Here, we created a novel ML-based predictor, namely iUP-BERT, which employed a deep learning pretrained neural network feature extraction method for model development.
For model performance improvement, the synthetic minority oversampling technique (SMOTE) [39] was applied first to overcome the data imbalance. To achieve higher prediction accuracy, the pretrained sequence embedding technique SSA or BERT was then combined with five different ML algorithms (KNN, LR, SVM, RF, and light gradient boosting machine (LGBM) [38]) to build several models. The features of the BERT method combined with the SVM model were finally selected and used to raise the prediction efficacy after optimization. The results from both the 10-fold cross-validation and independent test represented that the application of the deep representation learning BERT method remarkably improved the model performance in identifying umami peptides. IUP-BERT achieved higher accuracy than existing methods based on peptide sequence information alone. Figure 1 illustrates the overall framework of iUP-BERT. The main steps are as follows:
Upon the introduction of the peptide sequence, the pretrained sequence embedding technique, BERT, was used for feature extraction. For comparison, the SSA sequence embedding technique was included.

2.
After the feature extraction, BERT was fused with SSA to make an 889D fusion feature vector. 3.
The SMOTE was used to overcome the data imbalance.

4.
For feature space optimization, the LGBM feature technique method was used.

5.
Five different ML algorithms (KNN, LR, SVM, RF, and LGBM) were combined with the above techniques to build several models. The features of the BERT-SMOTE-SVM model were selected and applied to raise the prediction accuracy after optimization. 6.
The optimized feature representations were combined to establish the final iUP-BERT predictor.

Datasets
For fair comparison, the same peptide datasets (Supplementary File S1) used in previous umami peptide ML models were chosen [24]. In the datasets, 140 peptides either from experimentally validated umami peptides [10,15,16,20] or from BIOPEP-UWM databases [40] were taken as positive samples, whereas the negative samples were 302 nonumami peptides, identified as bitter peptides [41,42]. All peptide sequences in both the positive and negative samples were unique. The training dataset includes 112 umami and 241 non-umami peptides. The independent test dataset contains 28 umami and 61 nonumami peptides. (1) The peptide sequence was included as text and feature-extracted by the BERT model and SSA method. (2) The 788D BERT extracted feature was fused with the 121D SSA extracted features to make an 889D fusion feature vector, with individual feature vectors as comparison. (3) The SMOTE method was used to overcome the data imbalance. (4) The LGBM feature selection method was used to attain the best feature combinations. (5) Five different ML algorithms (KNN, LR, SVM, RF, and LGBM) were combined with the above techniques to build several models. (6) The final iUP-BERT predictor was established by combining the optimized feature representations. Here, BERT is for Bidirectional Encoder Representations from Transformers; SSA is for Soft Sequence Alignment; SMOTE: Synthetic Minority Oversampling Technique; LGBM is for Lighting Gradient Boosting Machine; D is for Dimension; KNN is for K-Nearest Neighbors; LR is for Logistic Regression; SVM is for Support Vector Machine; RF is for Random Forest. (1) The peptide sequence was included as text and feature-extracted by the BERT model and SSA method. (2) The 788D BERT extracted feature was fused with the 121D SSA extracted features to make an 889D fusion feature vector, with individual feature vectors as comparison.
(3) The SMOTE method was used to overcome the data imbalance. (4) The LGBM feature selection method was used to attain the best feature combinations. (5) Five different ML algorithms (KNN, LR, SVM, RF, and LGBM) were combined with the above techniques to build several models.

Feature Extraction
To extract different and effective features on umami peptide recognition, two deep representation learning feature extraction methods, the pretrained SSA sequence embedding model and the pretrained BERT sequence embedding model, were used. Meanwhile, the dataset was either pretrained with the SMOTE embedding model or not. To identify specific umami peptides, the models were trained on an alternate dataset. More comprehensive predictive models were created after comparison of different feature encoding schemes.

Pretrained SSA Embedding Model
SSA defines a brand-new approach to compare arbitrary-length sequences within vectors [36]. An initial pretrained model is utilized to encode a peptide sequence, as a three-tier stacked BiLSTM encoder output is utilized meanwhile (Figure 1) Each peptide sequence creates the final embedding matrix by employing a linear layer, R L×121 , in which L represents the peptide length. A model like this, which was trained and optimized by the SSA method, is called an SSA embedded model.
Consider two embedded metrics of R L×121 , with the names P 1 and P 2 for two distinct peptide sequences with varying lengths, L 1 and L 2 where α i and β i represent the 121D vector. If each amino acid sequence is encoded into a vector representation sequence, called P 1 and P 2 , we created an SSA mechanism to calculate the similarity between two amino acid sequences. Based on their embedded vectors, the similarity between the two sequences was determined as follows:ω τ ij is calculated by the following Formulas (4)-(7) A completely differentiated SSA reversely matched these parameters to the sequence encoder parameters. Individual peptide sequence was transformed into an embedding matrix, R L×121 , using the trained model. A 121D SSA feature vector was produced by averaging pooling procedures.

Pretrained BERT Embedding Model
BERT is a powerful natural language processing-inspired deep learning method [31]. The core of BERT is a transformer language model which has a variable number of encoder layers and self-attention heads, as shown in Figure 1. It provides a pretraining and finetuning approach, using enormous amounts of unlabeled data [32,33].
Here, the traditional BERT architecture was used to construct a BERT-based peptide prediction model ( Figure 1) There is no need to systematically design and select feature encodings in advance. Peptide sequences were taken as input directly and passed on to the BERT method to generate feature descriptors automatically. First, the peptide sequences were converted into the token representation of k-mers as input, and the positional embedding was added to obtain the final input token. Then, the semantics of the context was captured through the multi-head self-attention model. Certain adjustments were made through linear transformation, thus ending the forward propagation of the first layer (as shown in Figure 1) There are 12 such layers in the model. The result was used for the pretraining task of BERT. The mask task is still the traditional method, covering the part and then predicting, and backpropagating through the cross-entropy loss function. A 768D BERT feature vector was produced by the BERT-trained model.

Feature Fusion
To obtain the most superior feature combination, the 121D SSA eigenvector was combined with the 768D BERT eigenvector, which generated the 889D SSA+BERT fusion feature vector.

Synthetic Minority Oversampling Technique (SMOTE)
SMOTE is also called the "artificial minority oversampling method". It is an improved scheme based on the random oversampling algorithm [39]. The random oversampling algorithm generates additional minority samples through adopting a simply copying samples strategy. As a result, it has the risk of model overfitting, where the feature information is too specific and not general enough. The SMOTE method can effectively achieve the class balance in training data [43]. The basic idea is to analyze the minority samples, synthesize new categories of samples accordingly, and add artificially simulated new samples to the dataset. Briefly, the sampling nearest neighbor algorithm calculates the KNN of each minority class sample [43]. N samples are randomly selected from K neighbors for random linear interpolation to construct new minority class samples. Combination was made subsequently between the new samples and the original data to create a new training set. The program is kept running until the data imbalance meets the relevant requirements.

Machine Learning Methods
Five commonly used high-performance ML models were used for modeling. The k-nearest neighbor algorithm (KNN) model [25] is to find the K sample that is most similar as the given new sample, or the K sample that is "closest to it". If most of the K samples belong to a certain class, the sample also belongs to the same class.
Logistic regression (LR) [27] is a generalized linear model. It uses the sigmoid function to simulate the data distribution and act as the dividing line between positive and negative samples.
The support vector machine (SVM) [28,29] is to find a segmentation curve that maximizes the closest distance (also known as the interval) between data points of different classes. For binary classification, SVM is to get the furthest classification boundary and to make sure that the slight deviation of data would not have much impact.
Random forest (RF) [26] is an ensemble learning algorithm. It uses the samples with retractable samples to train multiple decision trees. Each node of the training decision tree only uses the partial features of the sampling, and it votes with the prediction results of these trees during the prediction. The voted majority class of a sample is the class to which the sample belongs.
Lighting gradient boosting machine (LGBM) [38] adopts the histogram algorithm. It converts continuous floating-point features into k discrete values, and constructs the histogram with a width of k. Then, the training data are traversed and the cumulative statistics of each discrete value in the histogram are collected. It uses a depth-limited leaf-wise strategy and supports parallel computing.

Performance Evaluation
Six widely used binary classification metrics were applied for performance evaluation, which are ACC, MCC, Sn, specificity (Sp), and BACC [44][45][46][47][48]. Here, TP is the given true positive sample number of umami peptides. TN is the true negative sample number of non-umami peptides. FP is the false positive sample number of non-umami peptides. FN is the false negative sample number of umami peptides.
The receiver operating characteristic curve (ROC) is a curve drawn according to a series of different classification methods (boundary value or decision threshold), with the true positive rate (sensitivity) as the ordinate and false positive rate (specificity) as the abscissa. ROC displays the relationship between true positives and false positives at different confidence levels [12,35,49]. Nevertheless, the ROC curve cannot clearly indicate which classifier is more superior. Thus, the area under the receiver operating characteristic curve (auROC) is usually adopted as an additional metric for model evaluation. The classifier with a larger auROC value performs better. The value of auROC for proposed models was computed and used to compare with the models reported previously.
For the model evaluation method, the widely used K-fold cross-validation method and independent testing method were adopted. Firstly, the K-fold cross-validation were applied for model training and validation evaluation based on the training set. In this study, the K value was 10. That is, the training set was randomly divided into ten parts, of which nine were used for training and one for validation. The performance of the trained model was evaluated by the average of 10 validation scores. Independent testing was to use additional new data, not in the training set, to test and evaluate the trained model. A good model requires good metrics value for both K-fold cross-validation and independent testing.

Preliminary Performance of Models Trained with or without SMOTE
To overcome the data imbalance in modeling, the SMOTE method was first applied to the modeling. Meanwhile, to explore the embedding feature types in umami peptides, different models were built based on two deep representation learning feature extraction methods, the pretrained SSA embedding model and the pretrained BERT embedding model, in combination with five distinct widely-used ML algorithms (KNN, LR, SVM, RF, and LGBM) The performance of the different combination models pretrained with or without SMOTE was compared by performing the repeated stratified 10-fold cross validation tests 10 times (Figure 2) to the modeling. Meanwhile, to explore the embedding feature types in umami peptides, different models were built based on two deep representation learning feature extraction methods, the pretrained SSA embedding model and the pretrained BERT embedding model, in combination with five distinct widely-used ML algorithms (KNN, LR, SVM, RF, and LGBM). The performance of the different combination models pretrained with or without SMOTE was compared by performing the repeated stratified 10-fold cross validation tests 10 times (Figure 2). LGBM.
For 10-fold cross-validation results (Figure 2), all five algorithm models using the SMOTE method based on either the SSA or BERT feature performed better across five metrics (ACC, MCC, Sn, auROC, and BACC) than the models not using SMOTE, with Sp as the exception. The scores after model parameter optimization are listed in Table 1. For example, the average ACCs of KNN, LR, SVM, RF, and LGBM based on SSA with SMOTE are 0.842, 0.857, 0.917, 0.915, and 0.917, respectively, which exceeded that of the models For 10-fold cross-validation results (Figure 2), all five algorithm models using the SMOTE method based on either the SSA or BERT feature performed better across five metrics (ACC, MCC, Sn, auROC, and BACC) than the models not using SMOTE, with Sp as the exception. The scores after model parameter optimization are listed in Table 1. For example, the average ACCs of KNN, LR, SVM, RF, and LGBM based on SSA with SMOTE are 0.842, 0.857, 0.917, 0.915, and 0.917, respectively, which exceeded that of the models without SMOTE by 1.08%, 10.44%, 10.88%, 9.45%, and 7.63%, respectively. A similar improvement was also observed in the 10-fold cross-validation results based on the BERT feature ( Figure 2 and Table 1) Although the best Sp values based on the SSA feature with SMOTE (0.913) were lower than those of the model without SMOTE (0.938), the overall best Sp score (0.959) was still obtained from the BERT feature optimized using the SMOTE method. For SMOTE performance in the independent test of the SSA or BERT feature vector (Table 1), still, the best scores were achieved using the SMOTE method across the five metrics. Take values based on SSA for example, the ACC is 0.866, with MCC to be 0.683, Sn to be 0.814, auROC to be 0.916, and BACC to be 0.825. These results indicate that increasing the sampling with SMOTE could effectively overcome the data imbalance and improve model performance in predicting umami peptides. Particularly, we noted that the BACC scores based on the five algorithms in the cross-validation results were the same as ACC with SMOTE being used ( Figure 2 and Table 1) As the metric BACC reflected the level of data balance, the data became balanced after SMOTE application, and BACC became redundant. Similar results were observed in the subsequent cross-validation analysis with SMOTE. LGBM: light gradient boosting machine. "−" indicates without the SMOTE method; "+" indicates with the SMOTE method.

The Effect of Different Feature Types
Meanwhile, from the cross-validation results ( Figure 2 and Table 1), the BERT feature vector developed using the SVM algorithm with SMOTE method performed best out of all the combinations tested across the five metrics (ACC, MCC, Sp, auROC, and BACC) Among them, ACC was 0.923 (0.65-18.9%) higher than the other options, with MCC being 0.849 higher by 1.67-75.0%, Sp being 0.959 higher by 2.24-33.0%, auROC being 0.884 higher by 1.76-20.9%, and BACC being 0.923 higher by 0.65-20.0%. Nevertheless, the SSA feature vector conjugated with KNN and SMOTE algorithms outperformed all the BERT combinations across the Sn metric (0.962) Regarding the performance of the BERT feature vector based on SVM with SMOTE in the independent test (Table 1), ACC was 0.876 lower by 2.03% compared with that of the BERT feature based on RF using SMOTE, with MCC being 0.706 lower by 11.0%, Sn being 0.714 lower by 21.1%, Sp being 0.951 higher by 7.09%, auROC being 0.926 lower by 4.63%, and BACC being 0.832 lower by 7.24%. Yet, the BERT-SVM-SMOTE combination was still supposed to be the best model out of all the combinations.

The Effect of Feature Fusion
To further improve the model performance and obtain more information, the SSA and BERT features were combined to make fusion features. The fusion feature was combined with the five algorithms (KNN, LR, SVM, RF, and LGBM) to train baseline models and improve model performance. Table 2 displayed the 10-fold cross-validation and independent testing results of the SSA-BERT fusion features with or without SMOTE. The performance metrics of the individual and fused features with SMOTE according to the ML methods are summarized in Figure 3. Consistent with the results in Section 3.1, for the 10-fold crossvalidation (Table 2), the SSA-BERT fusion feature with five models using SMOTE displayed a remarkably higher value than the models without SMOTE except for the Sp value, and the BACC score was the same as ACC with SMOTE being used. Particularly, the best performance of the fusion feature was slightly superior to the BERT feature alone across four metrics, with ACC being 0.934 higher by 1.19%, MCC being 0.867 higher by 1.90%, Sn being 0.971 higher by 1.25%, and BACC being 0.934 higher by 1.19%. However, the best performance of the fusion feature in the independent test results across all the six metrics (ACC = 0.876, MCC = 0.724, Sn = 0.857, Sp = 0.934, auROC = 0.919, BACC = 0.871) was in any aspect lower than the corresponding scores in the BERT feature alone (ACC = 0.896, MCC = 0.793, Sn = 0.905, Sp = 0.951, auROC = 0.971, BACC = 0.897) with SMOTE ( Figure 3 and Table 2) Thus, the feature fusion of SSA and BERT is not a beneficial choice for model optimization in umami peptide automatic prediction. LGBM: light gradient boosting machine. "−" indicates without the SMOTE method; "+" indicates with the SMOTE method.

The Effect of Feature Selection
As described in Section 3.3, feature fusion was not superior to BERT feature alone. In the training set, the sequence vector had 121 dimensions based on SSA feature, and 768 dimensions based on BERT, respectively. The feature vectors had 889 dimensions based on the combined fusion feature. Higher dimensions indicated a higher risk of information redundancy, that would result in model overfitting. Feature selection is a good way to solve this problem, which removes redundant and indistinguishable features [38]. The LGBM feature selection method has been proved to an effective approach for feature selection and was successfully applied for ML-based bio-sequence classification [38,50]. Here, we also used it to find the optimized feature space for umami peptide prediction task. Table 3 presented the performance metrics of the individual and fused features created based on five ML models (KNN, LR, SVM, RF, and LGBM) in conjugation with SMOTE. A visual illustration of the outcomes was shown in Figure 4.
From the 10-fold cross-validation results ( Figure 4 and Table 3), using feature selection, all the individual or fusion features based on the SVM algorithm outperformed the other four algorithms (KNN, LR, RF, and LGBM) across four metrics, namely ACC, MCC, Sp, and BACC. The best performance was observed in the BERT feature encoding alone based on the SVM algorithm with 139 dimensions over all the other options (Table 3)

The Effect of Feature Selection
As described in Section 3.3, feature fusion was not superior to BERT feature alone. In the training set, the sequence vector had 121 dimensions based on SSA feature, and 768 dimensions based on BERT, respectively. The feature vectors had 889 dimensions based on the combined fusion feature. Higher dimensions indicated a higher risk of information redundancy, that would result in model overfitting. Feature selection is a good way to solve this problem, which removes redundant and indistinguishable features [38]. The LGBM feature selection method has been proved to an effective approach for feature selection and was successfully applied for ML-based bio-sequence classification [38,50]. Here, we also used it to find the optimized feature space for umami peptide prediction task. Table 3 presented the performance metrics of the individual and fused features created based on five ML models (KNN, LR, SVM, RF, and LGBM) in conjugation with SMOTE. A visual illustration of the outcomes was shown in Figure 4.

Comparison of iUP-BERT with Existing Models
The efficacy and robustness of the iUP-BERT model in umami peptide identification was evaluated subsequently. Its predictive performance was compared with that of the existing methods, namely iUmami-SCM and UMPred-FRL. As shown in Table 4, from the cross-validation results, iUP-BERT apparently outperformed iUmami-SCM and UMPred-FRL across ACC, MCC, Sn, auROC, and BACC. Regarding the independent test results, iUP-BERT produced remarkably better results in the five metrics than iUmami-SCM and UMPred-FRL; for ACC by 1.23-3.93%, for MCC by 5.31-13.99%, for Sn by 13.6-25.07%, for auROC by 1.52-3.90%, and for BACC by 4.30-8.86%. Taken together, the comparisons show that iUP-BERT based on the BERT-SVM-SMOTE combination is more effective, reliable, and stable than the existing methods for umami peptide prediction.

Feature Analysis Using Feature Projection and Decision Function
To visually explain the excellent performance of iUP-BERT, principal components analysis (PCA) and uniform manifold approximation and projection (UMAP) dimension reduction were used. First, the feature space vector optimized by feature selection, namely BERT features of 139D, was reduced to a 2-dimensional plane using PCA and UAMP algorithms, respectively. As displayed in Figure 5, red dots represented umami peptides and blue dots represented non-umami peptides. Then, a decision function boundary was drawn, which could distinguish between positive and negative samples. As shown in Figure 5, the distribution of positive and negative samples is relatively concentrated in two areas; the positive samples are most in yellow areas, while the negative samples in the purple area. Additionally, we can see from Figure 5, that SVM can distinguish most positive and negative samples, yet there are still some misclassified samples. Therefore, better feature extraction methods or more suitable machine learning methods were needed for modeling, to better identify umami peptide sequences from non-umami peptide sequences in the future.

Construction of the Web Server of iUP-BERT
To facilitate rapid and high-throughput screening of umami peptides and maximize the use of the iUP-BERT predictor, an open-access web server was established at https://www.aibiochem.net/servers/iUP-BERT/ (accessed on 23 September 2022). We hope the iUP-BERT would be a powerful tool that can be used to explore new umami peptides and to promote the food seasoning industry.

Conclusions
In this study, a novel machine learning prediction model, namely iUP-BERT, was developed for the accurate prediction of umami peptides based on the peptide sequence alone. A single deep representation learning feature encoding method (BERT) was adopted to generate predicted probabilistic scores of potential umami peptides. First, SMOTE was applied to balance the data. Then, feature extraction approaches (SSA, BERT, or fused feature) were combined with five different algorithms (KNN, LR, SVM, RF, and LGBM) to build different models. After extensive testing and optimization, the BERT-SVM-SMOTE model with 139 dimensions was the best feature set. Further feature selection produced a robust model. To our knowledge, this is the first report on the utilization of the deep representing learning feature BERT in the computational identification of umami peptides. Subsequent 10-fold cross-validation and independent test results indicated the efficacy and robustness of iUP-BERT in predicting umami peptides. By comparison with the existing methods (iUmami-SCM and UMPred-FRL) based on the independent test, the iUP-BERT with BERT feature extraction method alone significantly outperformed the existing predictors with several manual feature extraction combinations; for ACC by 1.23-3.93%, for MCC by 5.31-13.99%, for Sn by 13.6-25.07%, for auROC by 1.52-3.90%, and for BACC higher by 4.30-8.86%. Finally, to maximize the use of the predictor, an open-access iUP-BERT web server was built at https://www.aibiochem.net/servers/iUP-BERT/ (accessed on 23 September 2022). For deep learning-based models, larger training sample size improves the prediction performance. As the number of the training

Construction of the Web Server of iUP-BERT
To facilitate rapid and high-throughput screening of umami peptides and maximize the use of the iUP-BERT predictor, an open-access web server was established at https: //www.aibiochem.net/servers/iUP-BERT/ (accessed on 23 September 2022) We hope the iUP-BERT would be a powerful tool that can be used to explore new umami peptides and to promote the food seasoning industry.

Conclusions
In this study, a novel machine learning prediction model, namely iUP-BERT, was developed for the accurate prediction of umami peptides based on the peptide sequence alone. A single deep representation learning feature encoding method (BERT) was adopted to generate predicted probabilistic scores of potential umami peptides. First, SMOTE was applied to balance the data. Then, feature extraction approaches (SSA, BERT, or fused feature) were combined with five different algorithms (KNN, LR, SVM, RF, and LGBM) to build different models. After extensive testing and optimization, the BERT-SVM-SMOTE model with 139 dimensions was the best feature set. Further feature selection produced a robust model. To our knowledge, this is the first report on the utilization of the deep representing learning feature BERT in the computational identification of umami peptides. Subsequent 10-fold cross-validation and independent test results indicated the efficacy and robustness of iUP-BERT in predicting umami peptides. By comparison with the existing methods (iUmami-SCM and UMPred-FRL) based on the independent test, the iUP-BERT with BERT feature extraction method alone significantly outperformed the existing predictors with several manual feature extraction combinations; for ACC by 1.23-3.93%, for MCC by 5.31-13.99%, for Sn by 13.6-25.07%, for auROC by 1.52-3.90%, and for BACC higher by 4.30-8.86%. Finally, to maximize the use of the predictor, an openaccess iUP-BERT web server was built at https://www.aibiochem.net/servers/iUP-BERT/ (accessed on 23 September 2022) For deep learning-based models, larger training sample size improves the prediction performance. As the number of the training datasheet used here were relatively low (112 positive and 241 negative samples), future efforts could be exerted on constructing an optimized larger size datasheet with higher amounts of identified umami and non-umami peptides for better model performance. Additionally, it would be to achieve a more accuracy model by fine-tuning the BERT for feature extraction. Finally, we hope the iUP-BERT would be a powerful tool for exploring new umami peptides to promote the umami seasoning industry.

Data Availability Statement:
The data used to support the findings of this study can be made available by the corresponding author upon request.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviation
The following abbreviations are used in this manuscript: