A Novel LSTM-Based Machine Learning Model for Predicting the Activity of Food Protein-Derived Antihypertensive Peptides

Food protein-derived antihypertensive peptides are a representative type of bioactive peptides. Several models based on partial least squares regression have been constructed to delineate the relationship between the structure and activity of the peptides. Machine-learning-based models have been applied in broad areas, which also indicates their potential to be incorporated into the field of bioactive peptides. In this study, a long short-term memory (LSTM) algorithm-based deep learning model was constructed, which could predict the IC50 value of the peptide in inhibiting ACE activity. In addition to the test dataset, the model was also validated using randomly synthesized peptides. The LSTM-based model constructed in this study provides an efficient and simplified method for screening antihypertensive peptides from food proteins.


Introduction
Globally, hypertension has been ranked as one of the major chronic diseases. It has been estimated that about 1.4 billion adults are suffering from hypertension worldwide and the prevalence is still on an upward trend [1]. The renin-angiotensin system (RAS) plays a major role in the regulation of blood pressure. Angiotensin II (Ang II), which is a potent vasoconstrictor in the RAS, is generated from Ang I with the action of angiotensin converting enzyme (ACE) [2]. Clinically, the inhibition of ACE activity to suppress the formation of Ang II has been considered an efficient strategy for the management of high blood pressure. Thus, synthetic ACE inhibitors have been used as a first-line pharmaceutical drug for hypertension therapy [3].
Notably, peptides that could inhibit ACE activity were identified from snake venom in 1971 and were characterized as ACE inhibitory peptides [4]. Since then, a large number of ACE inhibitory peptides have been identified from various natural protein sources, including food proteins such as milk proteins, egg proteins and soy proteins [5]. Compared with synthetic drugs, food protein-derived ACE inhibitory peptides are considered to have fewer side-effects and lower production costs, which makes these peptides a promising alternative for antihypertensive drugs. As a representative category of food protein-derived bioactive peptides, research on ACE inhibitory peptides is diverse and mainly focuses on peptide identification, mechanistic study and clinical trials [6]. Particularly over the past two decades, enormous efforts have been paid to delineate the relationship between the structure and activity of ACE inhibitory peptides. Since it has been widely accepted that the biological activity of a chemical structure can be described by its chemical features, such as its composition, electronic attributes and hydrophobicity [7], the value that inhibits 50% of the ACE activity (known as the IC 50 value) has been used as an output that correlates with the structural features of the peptides. Based on this principle, quantitative structure and activity relationship (QSAR) modelling was applied in order to predict the IC 50 value of the ACE inhibitory peptides, and several models have been established [8,9]. However, representing the structural features of a peptide is a complicated process. In addition, the use of different strategies for peptide representation may result in variations in the accuracy of these models.
Artificial neural networks (ANN) are algorithmic mathematical models that mimic the behavioral characteristics of animal neural networks and perform distributed parallel information processing. ANN relies on the complexity of the system and achieves the purpose of processing information by adjusting the interconnected relationships between a large number of internal nodes. The deep learning-based ANN has been widely applied in the field of biomedicine. Several deep learning-based models have been constructed to predict the activity of antioxidant peptides [10,11], anticancer peptides [12] and antibacterial peptides [13]. The long short-term memory (LSTM) network is a special type of recurrent neural network (RNN) that is capable of learning order dependence in sequence prediction problems. Compared with shallow learning, LSTM has a deep learning framework with a large number of hidden layers, allowing it to learn more complex non-linear patterns [14]. Notably, the LSTM-based model has been constructed for the discovery of antimicrobial peptides [15], suggesting the feasibility of applying LSTM in predicting the activity of bioactive peptides.
Collectively, an LSTM-based prediction model was constructed in the present study, which could provide an efficient and simplified structure and activity model for ACE inhibitory peptides, as well as enabling further exploration of the application of LSTM networks in the field of bioactive peptides.

An Overview of the Dataset
In total, 3429 peptide sequences with their corresponding IC 50 ACE inhibitory values were retrieved from the database and used in this study. As shown in Figure 1A, the IC 50 values of the peptides were variable and ranged from less than 1 µM to above 1000 µM. However, the IC 50 values of most of the peptides were less than 100 µM, indicating that these peptides have potent ACE inhibitory activity. In total, 2327 peptides in the data set were functional ACE inhibitory peptides.
Molecules 2023, 28, x FOR PEER REVIEW 2 of 12 As a representative category of food protein-derived bioactive peptides, research on ACE inhibitory peptides is diverse and mainly focuses on peptide identification, mechanistic study and clinical trials [6]. Particularly over the past two decades, enormous efforts have been paid to delineate the relationship between the structure and activity of ACE inhibitory peptides. Since it has been widely accepted that the biological activity of a chemical structure can be described by its chemical features, such as its composition, electronic attributes and hydrophobicity [7], the value that inhibits 50% of the ACE activity (known as the IC50 value) has been used as an output that correlates with the structural features of the peptides. Based on this principle, quantitative structure and activity relationship (QSAR) modelling was applied in order to predict the IC50 value of the ACE inhibitory peptides, and several models have been established [8,9]. However, representing the structural features of a peptide is a complicated process. In addition, the use of different strategies for peptide representation may result in variations in the accuracy of these models.
Artificial neural networks (ANN) are algorithmic mathematical models that mimic the behavioral characteristics of animal neural networks and perform distributed parallel information processing. ANN relies on the complexity of the system and achieves the purpose of processing information by adjusting the interconnected relationships between a large number of internal nodes. The deep learning-based ANN has been widely applied in the field of biomedicine. Several deep learning-based models have been constructed to predict the activity of antioxidant peptides [10,11], anticancer peptides [12] and antibacterial peptides [13]. The long short-term memory (LSTM) network is a special type of recurrent neural network (RNN) that is capable of learning order dependence in sequence prediction problems. Compared with shallow learning, LSTM has a deep learning framework with a large number of hidden layers, allowing it to learn more complex non-linear patterns [14]. Notably, the LSTM-based model has been constructed for the discovery of antimicrobial peptides [15], suggesting the feasibility of applying LSTM in predicting the activity of bioactive peptides.
Collectively, an LSTM-based prediction model was constructed in the present study, which could provide an efficient and simplified structure and activity model for ACE inhibitory peptides, as well as enabling further exploration of the application of LSTM networks in the field of bioactive peptides.

An Overview of the Dataset
In total, 3429 peptide sequences with their corresponding IC50 ACE inhibitory values were retrieved from the database and used in this study. As shown in Figure 1A, the IC50 values of the peptides were variable and ranged from less than 1 μM to above 1000 μM. However, the IC50 values of most of the peptides were less than 100 μM, indicating that these peptides have potent ACE inhibitory activity. In total, 2327 peptides in the data set were functional ACE inhibitory peptides. The amino acid distribution of the peptides from benchmark datasets was also analyzed. It is obvious that proline appeared most frequently, accounting for 19.2% of all the amino acids, which is strikingly higher than the frequency of the other amino acids ( Figure 1B). This finding is in line with previous reports that proline appears to be a frequent amino acid present in various bioactive peptides [10,16]. On the contrary, methionine is absent in the dataset, and the underlying reasons for this are yet to be determined ( Figure 1B).

Performance Evaluation of the Model
The variations in train loss and test loss for the LSTM model show that as the training cycle progresses, the variations in train loss and test loss decrease (Figure 2), which indicates that the prediction accuracy of the LSTM model could be improved through training. However, the curves of the train set and test set were not superimposable, which might be due to the limited number of data included in the test set. The amino acid distribution of the peptides from benchmark datasets was also analyzed. It is obvious that proline appeared most frequently, accounting for 19.2% of all the amino acids, which is strikingly higher than the frequency of the other amino acids ( Figure  1B). This finding is in line with previous reports that proline appears to be a frequent amino acid present in various bioactive peptides [10,16]. On the contrary, methionine is absent in the dataset, and the underlying reasons for this are yet to be determined ( Figure  1B).

Performance Evaluation of the Model
The variations in train loss and test loss for the LSTM model show that as the training cycle progresses, the variations in train loss and test loss decrease (Figure 2), which indicates that the prediction accuracy of the LSTM model could be improved through training. However, the curves of the train set and test set were not superimposable, which might be due to the limited number of data included in the test set. The performance of the model was then evaluated by five-fold cross-validation. The mean accuracy, average sensitivity and average specificity of the model was 85.20%, 84.92% and 85.43%, respectively. Furthermore, the RMSE was 0.18.
In addition, for the 343 peptides included in the test set, the ratio of the predicted IC50 and the reported IC50 was plotted. As shown in Figure 3, the ratio of 256 peptides were distributed within the range of 0.75 and 1.25, which suggested the accuracy of the model. The performance of the model was then evaluated by five-fold cross-validation. The mean accuracy, average sensitivity and average specificity of the model was 85.20%, 84.92% and 85.43%, respectively. Furthermore, the RMSE was 0.18.
In addition, for the 343 peptides included in the test set, the ratio of the predicted IC 50 and the reported IC 50 was plotted. As shown in Figure 3, the ratio of 256 peptides were distributed within the range of 0.75 and 1.25, which suggested the accuracy of the model. The amino acid distribution of the peptides from benchmark datasets was also analyzed. It is obvious that proline appeared most frequently, accounting for 19.2% of all the amino acids, which is strikingly higher than the frequency of the other amino acids ( Figure  1B). This finding is in line with previous reports that proline appears to be a frequent amino acid present in various bioactive peptides [10,16]. On the contrary, methionine is absent in the dataset, and the underlying reasons for this are yet to be determined ( Figure  1B).

Performance Evaluation of the Model
The variations in train loss and test loss for the LSTM model show that as the training cycle progresses, the variations in train loss and test loss decrease (Figure 2), which indicates that the prediction accuracy of the LSTM model could be improved through training. However, the curves of the train set and test set were not superimposable, which might be due to the limited number of data included in the test set. The performance of the model was then evaluated by five-fold cross-validation. The mean accuracy, average sensitivity and average specificity of the model was 85.20%, 84.92% and 85.43%, respectively. Furthermore, the RMSE was 0.18.
In addition, for the 343 peptides included in the test set, the ratio of the predicted IC50 and the reported IC50 was plotted. As shown in Figure 3, the ratio of 256 peptides were distributed within the range of 0.75 and 1.25, which suggested the accuracy of the model.

Model Validations
Based on the literature search, 54 peptides were retrieved that were reported with both their in vitro ACE inhibitory IC 50 value and their in vivo blood-pressure-lowering effect. We then applied our LSTM-based model to predict the IC 50 value of these peptides. As shown in Table 1, the ratio of the predicted IC 50 and the reported IC 50 of 38 peptides were distributed between 0.80 µM and 1.20 µM, among which, the ratio of 19 peptides were between 0.90 and 1.10. These results indicate the potential of our LSTM-based model to predict the IC 50 value of antihypertensive peptides with in vivo activity. Finally, 20 peptides were randomly generated and synthesized. The LSTM-based model was then applied to predict the IC 50 value of these peptides. The experimental IC 50 value of each peptide was provided via the HPLC-based assay. As shown in Table 2, the ratio of the predicted IC 50 and the experimental IC 50 of 15 peptides were between 0.75 and 2. Such a result suggests the feasibility of predicting the ACE inhibitory value of a random sequence using the model developed in the present study.

Discussion
Food protein-derived antihypertensive peptides are one of the representative categories of bioactive peptides. Research on antihypertensive peptides has been ongoing for about five decades. Research into the structure and activity relationship of peptides has long been a prominent research area. The QSAR modelling of food protein-derived antihypertensive peptides started about two decades ago. Initially, research concentrated on the structural features of di-and tri-peptides using partial least squares regression [39]. However, the efficiency of the model was too limited to be used for high throughput prediction. Notably, machine-learning-based techniques have been applied widely across multiple areas. Importantly, several machine-learning-based models have been constructed that could be used to predict the activity of antioxidant, anticancer and antimicrobial peptides [11,13,15], which indicates the feasibility of applying machine learning algorithms in the QSAR modelling of bioactive peptides.
It has been previously reported that a machine learning model based on the support vector machine algorithm was developed to predict the antihypertensive activity of food protein-derived bioactive peptides. However, the accuracy of the model was less than 80% [40]. In a later study, the extremely randomized tree algorithm was applied, and the performance of the model was improved to 85.0%. However, this model consists of 51 feature descriptors, which makes the model complicated [41]. Notably, we utilized an LSTM deep learning model to investigate the relationship between peptide structure and bioactivity in the present study. As a special type of RNN, LSTM has the advantage of capturing historical information from prior inputs, allowing it to influence the current input and output applications for speech recognition, natural language processing and time series prediction [42]. In real-life data analyses, when the time interval is long due to the gradient vanishing problem, RNN does not have the ability to memorize the previous information well. To overcome this disadvantage, LSTM was proposed by combining short-term memory with long-term memory through gate control [43]. Importantly, our results demonstrate that the LSTM model achieved a correlation coefficient of 0.85 on the validation dataset. In addition, the LSTM model's superiority over other models may stem from its ability to capture the sequential nature of peptide data, which allows it to detect subtle structural patterns that influence bioactivity. However, it is important to note that LSTM models have some potential drawbacks, including high computational costs due to their complex architecture and the possibility of overfitting if the dataset is not diverse enough. Therefore, future studies should explore ways to optimize LSTM model performance while controlling these factors.
Since the research on antihypertensive peptides originated from ACE inhibitory peptides, the database available for deep learning training was constructed based on the IC 50 values of the peptides in the in vitro ACE inhibitory assay. Thus, despite the satisfactory performance of the model developed in this study, the biological significance of the model is yet to be determined. Furthermore, it is suggested that the current peptide database should be expanded by adding the results from biologically relevant assays, such as cellular experiments and animal studies, if available. The information from biologically relevant assays could be incorporated into the machine learning model in the future. On the other hand, studies in recent years have shown that there may be targets other than ACE for antihypertensive peptides in the context of reducing blood pressure [44]. Hence, it is also recommended that a database based on the other activity parameters of the peptides is constructed.
The LSTM-based model developed in this study also demonstrated high efficiency in predicting the IC 50 values of randomly generated peptides. Therefore, this model could be potentially applied in peptide design, which may create novel opportunities for the screening of antihypertensive peptides. In addition, a recent study developed a machine learning empowered model capable of performing in silico gastrointestinal digestion of food proteins [45], which could be incorporated into our model to create a more comprehensive activity prediction system. However, only peptides composed of less than six amino acids were randomly generated, and the ability of the model to predict the activity of longer peptides is yet to be determined.

Benchmark Dataset
The peptide sequences used for data training in this study were obtained from a number of databases, including BIOPEP-UWM [46], FeptideDB [47] and BioPepDB [48]. In addition, we manually searched the literature to identify the peptides that were not included in the above databases. All of the peptides in the present study were manually curated, merged and cross-checked in order to construct a non-redundant data set. Furthermore, only peptides with an IC 50 value less than 2000 µM were included in this study. Following data collection, the data was randomly divided into a training set and a validation set for the model in a ratio of 9:1.

Literature Searching Strategy
PubMed and Web of Science were searched in order to identify studies investigating the IC 50 in in vitro ACE inhibitory assays, as well as the in vivo blood-pressure-lowering effect of food protein-derived bioactive peptides published up to April 2023. The search was performed using the following strings: "Bioactive peptides" AND "ACE inhibition" AND "Blood pressure reduction". For model validations, peptides with known sequences that have been previously reported to exhibit in vitro ACE inhibitory IC 50 values and significant in vivo blood pressure lowering effects were used in this study.

Representation of the Peptide Sequence
The 19 amino acids that appeared in all the peptides were mapped to different integers, as shown in Table 3. Then, each peptide sequence was converted into a digital sequence, which was then packaged into a Pytorch dataset, with a batch size of 32 as per the specified scale.

Machine Learning Algorithms
As shown in Figure 4, the LSTM network consisted of one input and output layer and a series of recurrently connected hidden layers. The hidden layers were memory blocks, with an input gate, an output gate, a forget gate and some self-recurrent memory cells. The input, output and forget gates provided read, write and reset operations for the memory cells, respectively. Figure 1 gives an example of an LSTM memory block with a single cell. There exists a recurrently self-connected linear unit-constant error carousel (CEC) at the core of each memory block. The outside interference was stopped by the self-recurrent memory cell and the status was held from one time point to another. This is why the LSTM can solve the vanishing gradient problem. Assuming that the model input at time t was X t = (X t1 , . . . , X tn ) , where n is the number of input dimensions, the input gate selected the information of input X t to be saved into cell C t . The forget gate selectively forgot the state of the last moment cell C t−1 . The forget gate learnt to reset memory blocks once their status was out of date. Furthermore, the forget gate prevented the cell status from growing boundless and saturating the squashing function. The components of the output ht were controlled by the output gate; that is, the output gate controlled the ability of the cell state to influence other neurons.
To show the details, the training process of the LSTM model can be formulated with some equations. The input gate i t and the forget gate f t have the following formulas: where h t−1 is the output of the previous cell, X t is the input and b and W denote the bias vectors and the weight matrices, respectively. Then, we can update the cell state C t using the following formula: where C t−1 is the state of the previous cell, b c and W c denote the bias vector and weight matrix, respectively. Finally, the output gate o t and output h t can be defined as: where b o and W o denote the bias vector and weight matrix, respectively. δ(·) and tanh(·) are the sigmoid and the tanh functions defined as follows: tanh(a) = e a − e −a e a + e −a .
The training frequency was set to 100 times. In each training session, the program disrupted the order of the entire database and reprocessed, encapsulated and allocated the training and validation sets in a 9:1 ratio. Then, the training set data was used to adjust the parameters of the model, and the validation set data was used to calculate the current error of the model. When the calculated error was less than the previous minimum error, the current model parameters and the output results of the model for the validation set were retained.
Molecules 2023, 28, x FOR PEER REVIEW 8 of 12 time was = ( 1, …, ) ⊤ , where is the number of input dimensions, the input gate selected the information of input to be saved into cell . The forget gate selectively forgot the state of the last moment cell −1. The forget gate learnt to reset memory blocks once their status was out of date. Furthermore, the forget gate prevented the cell status from growing boundless and saturating the squashing function. The components of the output ℎ were controlled by the output gate; that is, the output gate controlled the ability of the cell state to influence other neurons.
where ℎ −1 is the output of the previous cell, is the input and and denote the bias vectors and the weight matrices, respectively. Then, we can update the cell state using the following formula: where −1 is the state of the previous cell, and denote the bias vector and weight matrix, respectively. Finally, the output gate and output ℎ can be defined as:

Model Evaluations
The model was evaluated in two dimensions. Firstly, the accuracy of the model in predicting the IC 50 value of the specific peptide was assessed using the ratio of the predicted value and the reported or experimental IC 50 value. The prediction was defined as "accurate" when the ratio matrix was within the range of 0.75 and 1.25, otherwise it was considered "inaccurate". In this way, the regression task was converted to the classification task, which was further used for the five-fold cross-validation.
To assess the overall reliability of the model, a five-fold cross-validation was executed according to the literature, in which the original dataset was randomly separated into five equally sized sub-samples. Then, each sub-sample was used for the test data, whereas the remaining sub-samples were used for the training set. The cross-validation process was then repeated five times. The average of the five-fold cross-validation yielded the accuracy of the algorithm [49,50]. The results of the five-fold cross-validation were presented as the mean accuracy, average sensitivity and average specificity. In addition, the root mean square error (RMSE) of the model was calculated according to the following formula: where m is the sample size, y is the reported value andŷ is the predicted value.

The In Vitro ACE Inhibitory Assay
An online tool (https://www.genscript.com/sms2/random_protein.html accessed on 1 March 2023) was used to randomly generate peptides in order to test the efficiency of the model. The top peptides with a small number of IC 50 values were selected for synthesis. In addition, since peptides composed of less than six amino acid residues possess stability in the gastrointestinal tract [51], the maximum length of the generated peptides consisted of five amino acids. The peptides used for validation were synthesized by Genescript with a purity > 97%. The ACE inhibitory assay was performed according to a previous study [52] with modifications. ACE, N-hippuryl-His-Leu tetrahydrate (HHL, Sigma-Aldrich, St. Louis, MI, USA) and the peptide samples were dissolved in 100 mM of boric acid containing 300 mM of NaCl (pH8.3). Firstly, 10 µL of the peptide solution was preincubated with 50 µL of 6.5 mM HHL at 37 • C for 5 min. Then, 5 µL of 0.1UN/mL ACE (preincubated at 37 • C) was added to the reaction system and incubated at 37 • C for another 30 min. The reaction was terminated by adding 85 µL of 1 M HCl. The concentration of hippuric acid (Hip, the reaction product) was measured by HPLC with a C 18 column (5 µm, 250 mm × 4.6 mm). The sample (20 µL) was eluted by a gradient of solvent A (H 2 O with 0.05% TFA) and solvent B (acetonitrile with 0.05% TFA) at a flow rate of 1.2 mL/min. The absorbance at 228 nm was monitored. The concentration of HA was calculated based on its standard curve. The area under each peak was calculated, in which A = the area under the peak of the blank group (without the peptide) and B = the area under the peak of the peptide group. The ACE inhibitory ratio = (A − B)/A. The ACE inhibitory ratio of each peptide at different concentrations was measured. The IC 50 value was defined as the peptide concentration inhibiting 50% of the ACE activity.

Conclusions
In this study, a novel model utilizing the LSTM-based deep learning network was constructed to predict the activity of food protein-derived antihypertensive peptides. The model achieved excellent performance in activity prediction, which was validated by both the test set of the benchmark dataset and the in vitro ACE inhibitory assay for randomly generated peptides. Therefore, this model could be used to screen antihypertensive peptides from various food proteins. In addition, this research provides a novel aspect for the QSAR study of antihypertensive peptides.

Conflicts of Interest:
The authors declare no conflict of interest.