MPMABP: A CNN and Bi-LSTM-Based Method for Predicting Multi-Activities of Bioactive Peptides

Bioactive peptides are typically small functional peptides with 2–20 amino acid residues and play versatile roles in metabolic and biological processes. Bioactive peptides are multi-functional, so it is vastly challenging to accurately detect all their functions simultaneously. We proposed a convolution neural network (CNN) and bi-directional long short-term memory (Bi-LSTM)-based deep learning method (called MPMABP) for recognizing multi-activities of bioactive peptides. The MPMABP stacked five CNNs at different scales, and used the residual network to preserve the information from loss. The empirical results showed that the MPMABP is superior to the state-of-the-art methods. Analysis on the distribution of amino acids indicated that the lysine preferred to appear in the anti-cancer peptide, the leucine in the anti-diabetic peptide, and the proline in the anti-hypertensive peptide. The method and analysis are beneficial to recognize multi-activities of bioactive peptides.


Introduction
Bioactive peptides are small protein fragments, that generally contains 2-20 amino acid residues [1,2]. The bioactive peptides remain inactive when they are encrypted in the precursor protein, while the bioactive peptides are active once they are released from the precursor protein. Bioactive peptides are not only distributed widely in foods, plants, and animals [3], but also play versatile roles in the metabolic and biological processes. For example, some bioactive peptides were reported to resist the action of digestion peptidases [4], some were proved to be of anti-bacterial and anti-oxidant activity [1], while some had immunomodulatory and anti-cancer activities [5]. Therefore, it is of great importance to accurately identify activities of bioactive peptides in at least two respects: (1) it is helpful to promote understanding of the mechanism of bioactive peptides; and (2) it is fundamental to develop new natural foods and drugs to meet the demands for safety and health.
The bioactive peptides are organic substances comprising amino acids joined by covalent bonds. According to mode of action, the bioactive peptides are classified as the anti-microbial peptide (AMP), the anti-diabetic peptide (ADP), the anti-hypertensive peptide (AHP), the anti-inflammatory peptide (AIP), the anti-cancer peptide (ACP), the anti-oxidant peptide, the immunomodulatory peptide, and so on [3,6]. Since the AMP have anti-bacterial, anti-fungal, or anti-viral properties, they are also called host defense peptides (HDPs) which are distributed widely in the innate immune response system. When the host was invaded by foreigners such as a virus or bacterium, the AMP was induced to destroy or kill the invading foreigners with the membrane damage mechanism [1,7]. The AHP the methods for distinguishing therapeutic peptides from non-therapeutic peptides include PEPred-Suite [81], PTPD [82], PPTPP [83], and PreTP-EL [84]. Most bioinformatics approaches suffered from the small number of bioactive peptides. He et al. [85] pioneered mutual information meta learning to address small samples of bioactive peptides prediction, while Zhang et al. [48] employed the pre-trained natural language model BERT [86] to predict the AMP.
All the previous methods are only suitable for differentiating specific activity of bioactive peptides. In practice, a bioactive peptide might simultaneously consist of multiactivities. Obviously, to computationally identify activities of bioactive peptides is a multilabel and multi-class issue. Recently, Tang et al. [87] presented a convolution neural network (CNN) and gated recurrent unit (GRU)-based deep learning method (called MLBP [87]) for multi-activities of bioactive peptide prediction. This is a promising avenue for identifying actual activities of bioactive peptides. For the deep learning method, the ability to learn a representation would depend on what components it adopted and ways of combining components. The MLBP [87] is a deep learning architecture with three different scale parallel CNNs followed by the GRU [88]. The CNN is the most widely used architecture of neural work especially in the field of image processing, which is capable of characterizing local properties [89,90], while the long short-term memory (LSTM) is a popular architecture to capture semantics in the context of text sequences [91]. The structure of the CNN followed directly by the LSTM would absorb merits of both components. However, the MLBP attached the GRU to three-scale CNNs, which causes multi-scale information loss. In addition, with the increase in the depth of the deep neural network, the original information about sequences would drop seriously. On the basis of the analysis above, we improved the MLBP [87] in two respects. One is that we used multi-branch CNNs, each followed directly by the semantic architecture to improve representation of peptides. The other is to use the residual network architecture to ensure no loss of information about peptides in the forward process. In addition, we replaced the GRU by the Bi-LSTM. The proposed method is abbreviated to MPMABP. The empirical experiments showed that the MPMABP outperformed the MLBP [87].

Optimization of Parameters
In the MPMABP, there are many user-defined hyper-parameters such as the embedding dimension, the learning rate, the dropout, and the pooling size which are influential in its predictive performance. We separated 20 percent from the training set as the validation set to investigate influence. We tested four embedding dimensions (50, 100, 150, and 200), four learning rates (0.1, 0.01, 0.005, 0.003, and 0.001), four dropouts (0.1, 0.2, 0.3, and 0.5), and four pooling sizes (2, 3, 4, and 5). As shown in Figure 1, the accuracies over various values of the same type of hyper-parameter are relatively stable, exhibiting only a slight difference between the various parameter values. According to the general experience, we set the embedding dimension to 100, the learning rate to 0.001, the dropout to 0.5, and the pooling size to 3, respectively. The details of other hyper-parameters in the MPMABP are listed in Table 1.

Comparison with State-of-the-Art Methods
To the best of our knowledge, the MLBP [87] is the latest method to classify multifunctional multi-label bioactive peptides so far. Of course, there are some multi-label algorithms which are applicable to predicting bioactive peptides, such as calibrated label ranking (CLR) [92], random k-label sets (RAKEL) [93], ranking support vector machine, and binary relevance with robust low-rank learning (RBRL) [94], and multi-label learning with deep forest (MLDF) [95]. We conducted the same experiments as the MLBP [87] for comparison. As shown in Table 2, the MPMABP outperformed the MLBP in terms of Precision, Coverage, Accuracy, and Absolute true. The MPMABP promoted the Precision by about 0.034, the Coverage by 0.037, the Accuracy by 0.027, and the Absolute true by 0.011. The lower the Absolute false, the better the predictive performance. The MPMABP decreased the Absolute false by 0.010. We compared five methods over the independent test. As shown in Table 3, we observed the best Precision, the best Coverage, the best Accuracy, the best Absolute true, and the worst Absolute false in the MPMABP, implying that the MPMABP is comprehensively superior to the state-of-the-art methods.  We compared predictive performances of five methods on single functional bioactive peptides. SN and SP were computed by Equations (7) and (8), where the investigated category of bioactive peptides is viewed positive and other as negative. For example, when we computed SN of the AIP, all the AIP bioactive peptides were viewed positive and the other as negative. As shown in Figure 2, predictive performance differs largely with categories. The MPMABP reached the best SN in terms of AIP and AHP, and the best SP in terms of ACP and ADP. However, the MPMABP is inferior to the MLBP in terms of AMP, ACP, and ADP. Since the predictive performances of the MPMABP over AHP are far greater than that of the MLBP, the MPMABP is, as a whole, superior to the MLBP.
Recently, many methods have been developed to identify single activity of bioactive peptides. In order to validate effectiveness and efficiency in classifying activities of bioactive peptides, we compared some state-of-the-art methods which provide web applications, i.e., IAMP-RAAC [96], mAHTPred [97], AHPPred [98], and AIPpred [99]. The IAMP-RAAC [96] is a reduced amino acid cluster-based method for distinguishing AMP from ACP, the mAHTPred [97] is a meta predictor for AHP, the AHPPred [98] is an CNN-and LSTM-based method for AHP prediction, and the AIPpred [99] is a random-forest-based predictor for AIP. Except the IAMP-RAAC [96], all the methods can only be applied to predict specific activity of bioactive peptides. For fair comparison, removing overlapping bioactive peptides with the training samples, we used the overlapping samples with these independent tests in these methods, respectively. Table 4 lists the performances (SN). Obviously, except the mAHTPred, the MPMABP outperformed the other three state-of-theart methods. Recently, many methods have been developed to identify single activity of bioactive peptides. In order to validate effectiveness and efficiency in classifying activities of bioactive peptides, we compared some state-of-the-art methods which provide web applications, i.e., IAMP-RAAC [96], mAHTPred [97], AHPPred [98], and AIPpred [99]. The IAMP-RAAC [96] is a reduced amino acid cluster-based method for distinguishing AMP from ACP, the mAHTPred [97] is a meta predictor for AHP, the AHPPred [98] is an CNNand LSTM-based method for AHP prediction, and the AIPpred [99] is a random-forestbased predictor for AIP. Except the IAMP-RAAC [96], all the methods can only be applied to predict specific activity of bioactive peptides. For fair comparison, removing overlapping bioactive peptides with the training samples, we used the overlapping samples with these independent tests in these methods, respectively. Table 4 lists the performances (SN). Obviously, except the mAHTPred, the MPMABP outperformed the other three state-of-the-art methods.

Case Study
In order to prove further predicting ability of the MPMABP, we randomly chose 10 bioactive peptides to be predicted. Table 5 lists predictions by three methods over 10 bioactive peptides. The MPMABP correctly predicted multi-activities of all the bioactive peptides. The MLBP [87] predicted correctly 7, mistakenly 1, and partly correctly 2 of 10 bioactive peptides. MultiPep [100] is also a method which is able to predict up to 12 types of bioactive peptide. We utilized the webserver of the MultiPep: https://agbg.shinyapps.io/MultiPep/ (accessed on 5 February 2022) to perform prediction. Obviously, the MultiPep predicted less classes than the true class for ADP-463 and AIP-1050 and predicted more classes for the other seven bioactive peptides. The 10 cases illustrated the superior predictive performance of the MPMABP over the MLBP [87] and the MultiPep [100].

Sequence
True labels Prediction

Case Study
In order to prove further predicting ability of the MPMABP, we randomly chose 10 bioactive peptides to be predicted. Table 5 lists predictions by three methods over 10 bioactive peptides. The MPMABP correctly predicted multi-activities of all the bioactive peptides. The MLBP [87] predicted correctly 7, mistakenly 1, and partly correctly 2 of 10 bioactive peptides. MultiPep [100] is also a method which is able to predict up to 12 types of bioactive peptide. We utilized the webserver of the MultiPep: https://agbg.shinyapps.io/MultiPep/ (accessed on 5 February 2022) to perform prediction. Obviously, the MultiPep predicted less classes than the true class for ADP-463 and AIP-1050 and predicted more classes for the other seven bioactive peptides. The 10 cases illustrated the superior predictive performance of the MPMABP over the MLBP [87] and the MultiPep [100].

Discussion
The MPMABP is a CNN and Bi-LSTM-based deep learning method for predicting multi-label bioactive peptides. The MPMABP stacked five CNN and Bi-LSTM modules in a parallel manner. The MPMABP utilized the ResNet to preserve necessary information in the forwarding process. We investigated, respectively, predictive performances of the MPMABPs without the ResNet (called MPMABPwr) and in a series-connection manner (called MPMABPsc). Table 6 shows predictive performances over the 5-fold cross-validation and the independent test. Contrasting Table 6 with Tables 2 and 3, we found that the inclusion of the ResNet and the parallel manner remarkably improved the predictive performances, respectively. The CNN and the LSTM are two dominating components in deep learning, each with respective advantages. The CNN is good at characterizing local properties, while the LSTM does well in capturing semantic of words in context of the sequences. We combined two architectures to make full use of their merits. We experimented with many simpler architectures of deep neural network, that is, the MPMABP without the CNN, the MPMABP without the LSTM, and the MPMABP with only one branch. Tables 7 and 8 show the predictive performance by five-fold cross-validation and by the independent test. The exclusion of the CNN or the LSTM from the original MPMABP lead the predictive performance to decrease. The degeneration of the MPMABP also reduced the ability to accurately classify bioactive peptides  We investigated distribution of amino acid over five categories of bioactive peptides. As listed in Figure 3, some distributions are common in all classes, but some have remarkable differences across different types of bioactive peptides. The amino acid K appears more frequently in the ACP, P more frequently in the AHP, and L more frequently in the ADP. We investigated distribution of amino acid over five categories of bioactive peptides. As listed in Figure 3, some distributions are common in all classes, but some have remarkable differences across different types of bioactive peptides. The amino acid K appears more frequently in the ACP, P more frequently in the AHP, and L more frequently in the ADP.

Datasets
We used the same experimental dataset as in [87]. The dataset was retrieved by searching the Google Scholar engine with the keyword bioactive peptide, in 2020 [87]. The initial dataset included 18 types of bioactive peptides. Since the number of training samples is too small to train deep neural network favorably, the peptides of less than 500 residues were dropped out. Consequently, five types of functional peptides (AMP, ACP, ADP, AHP, and AIP) were preserved. The clustering tool CD-HIT [101] was used to remove or decrease redundancy and homology. The sequence identity was set to 0.9. The final numbers of the ACP, the ADP, the AHP, the AIP, and the AMP are, respectively, 646, 514, 868, 1678, and 2409, as shown in Figure 4. Obviously, most bioactive peptides are of only one type of activity, a small number of bioactive peptides are simultaneously of two types, and none belong to more than two types. This is a multi-class and multi-label issue. In total, 80 percent of all the peptides were randomly sampled as the training set and the remaining 20 percent were used the testing set.

Datasets
We used the same experimental dataset as in [87]. The dataset was retrieved by searching the Google Scholar engine with the keyword bioactive peptide, in 2020 [87]. The initial dataset included 18 types of bioactive peptides. Since the number of training samples is too small to train deep neural network favorably, the peptides of less than 500 residues were dropped out. Consequently, five types of functional peptides (AMP, ACP, ADP, AHP, and AIP) were preserved. The clustering tool CD-HIT [101] was used to remove or decrease redundancy and homology. The sequence identity was set to 0.9. The final numbers of the ACP, the ADP, the AHP, the AIP, and the AMP are, respectively, 646, 514, 868, 1678, and 2409, as shown in Figure 4. Obviously, most bioactive peptides are of only one type of activity, a small number of bioactive peptides are simultaneously of two types, and none belong to more than two types. This is a multi-class and multi-label issue. In total, 80 percent of all the peptides were randomly sampled as the training set and the remaining 20 percent were used the testing set.

Methodology
As shown in Figure 5, the proposed MPMABP is an end-to-end deep learning model which is made up of the 1D CNN, the LSTM, the embedding, the batch normalization, and the full-connected layer. The input to the MPMABP is amino acid sequences, which are subsequently transformed into continuous vectors by the embedding layer. Five parallel modules follow the batch-normalization layer to extract deep and abstract representations, of which each is constructed by linking 1D CNN, Bi-LSTM, and max pooling in order. To keep the information, the ResNet structure is used. All the representations are concatenated to be entered into the classification module, which consists of three fully connected layers and a dropout layer. The final fully connected layer has five neurons with the sigmoid function. The output of each neuron stands for a probability of belonging to a corresponding type of peptide.

Methodology
As shown in Figure 5, the proposed MPMABP is an end-to-end deep learning model which is made up of the 1D CNN, the LSTM, the embedding, the batch normalization, and the full-connected layer. The input to the MPMABP is amino acid sequences, which are subsequently transformed into continuous vectors by the embedding layer. Five parallel modules follow the batch-normalization layer to extract deep and abstract representations, of which each is constructed by linking 1D CNN, Bi-LSTM, and max pooling in order. To keep the information, the ResNet structure is used. All the representations are concatenated to be entered into the classification module, which consists of three fully connected layers and a dropout layer. The final fully connected layer has five neurons with the sigmoid function. The output of each neuron stands for a probability of belonging to a corresponding type of peptide.

Methodology
As shown in Figure 5, the proposed MPMABP is an end-to-end deep learning model which is made up of the 1D CNN, the LSTM, the embedding, the batch normalization, and the full-connected layer. The input to the MPMABP is amino acid sequences, which are subsequently transformed into continuous vectors by the embedding layer. Five parallel modules follow the batch-normalization layer to extract deep and abstract representations, of which each is constructed by linking 1D CNN, Bi-LSTM, and max pooling in order. To keep the information, the ResNet structure is used. All the representations are concatenated to be entered into the classification module, which consists of three fully connected layers and a dropout layer. The final fully connected layer has five neurons with the sigmoid function. The output of each neuron stands for a probability of belonging to a corresponding type of peptide.

Embedding Layer
The embedding layer serves as a transformer which converts the sequences of text into continuous digital vectors. Before embedding of text, we pre-processed the peptide sequences. Since the sequence lengths of the bioactive peptides are not identical, ranging from 5 to 517, we padded those peptides of less than 517 residues with the specific character 'X'. All the characters of peptides were converted into integers. The integer sequences are actually the input to the embedding layer.

Multi-Scale CNN
The CNN is one of most important components for constructing deep complex neural networks, which was initially created by Fukushima et al. [102,103], forming the theoretical foundation by utilizing the backpropagation for training [104], and later was dramatically developed by integration with deep neural networks [89,90,105,106]. At the heart of the CNN is the convolution operation, which is used to multiply receptive fields with the convolution kernel in the element-wise manner and then to sum all the products. The convolution kernel serves as filters in the field of signals, and thus is also called filters. The size of the convolution kernel is influential for representation of the original features. The larger size could capture global structures, while the smaller size could characterize local structure. For extracting different scale representations from sequences, we used five convolution kernels with different sizes. As shown in Figure 4, the smallest size is 3, and the largest size is 12. Therefore, we obtained multi-scale representations of primary sequences.

Bi-LSTM
The LSTM proposed by Hochreiter et al. [107] is an improved recurrent neural network (RNN) [108][109][110]. The LSTM [107] introduces the gate mechanisms such as the forget gate, the output gate, and the input gate, and thus solves well the gradient vanishing or exploding issue occurring in the long sequence analysis. Compared with the traditional RNN, the LSTM is capable of capturing long-distance dependency. Therefore, the LSTM has been used in a wide range of fields including action recognition [111], succinylation prediction [28], and N4-Acetylcytidine prediction [21]. The single LSTM is unidirectional, which is generally able to uncover relationships with previous words. Therefore, the Bi-LSTM is used in practice. The Bi-LSTM [91,112] is composed of two LSTMs in opposite directions, one from the front to the back, and the other from the back to the front. Two LSTMs have identical inputs but have completely different learnable parameters. The outputs of two LSTMs are concatenated as the output of the Bi-LSTM.

Pooling
Pooling is a popular operation in the CNN, which serves as non-linear down-sampling. The pooling has dual roles. One is to decrease the dimensionality of representations, to save storage space, and to accelerate the calculation and another is to avoid over-fitting issues. The pooling operations include max pooling and average pooling. We used the max pooling herein.

ResNet
The ResNet [113] is actually the improved version of the CNN. The ResNet is very simple but effective. As shown in Figure 4, the ResNet consists mainly of two branches: one is to directly link the next layer and the other is the CNN. The sum of the input and the output of the CNN is the output of the ResNet. The ResNet enables construction of deeper neural networks without loss of information. These popular deep learning methods such as VGG, Transformer, and GoogleNet used the ResNet architecture. Here, we used the ResNet to fuse multi-scale representations and original information.

Fully Connected Layer
The fully connected layer is identical to the hidden layers in the multilayer perceptron, which is generally used as linear representations of inputs. Therefore, it is essential to classification or embedding representation in deep learning. We used three fully connected layers, of which the last has five neurons, each to represent a class of the functional peptide. Because this is a multi-label multi-class issue, we used the sigmoid activation function in the last fully connected layer. The neuron outputting more than 0.5 indicated that the input belonged to the corresponding functional peptides. We also used one dropout following the first and the second fully connected layers, respectively, so as to decrease overfitting.

Validation and Evaluation Metrics
We employed both hold-out and 5-fold cross-validation to examine the proposed method. In the hold-out, 80 percent of all the experimental peptides are sampled randomly as the training set, and the remaining 20 percent as the validation set. The model is trained by the training set and then validated by the validation set. In the 5-fold cross-validation, the training set is separated into five parts on average. Four parts are used to train the model and the remaining is used to test the model. The process is repeated five times.
For convenient comparison with the state-of-the-art methods, we used the same evaluation metrics as the MLBP [87], the CLR [92], the Rakel [93], the MLDF [95], and the RBRL [94]. These metrics are defined below.
where L i and L * i denote the set of actual labels and predicted labels for the sample I, respectively, N is the total number of the testing samples, ∪ as well as ∩ denote the union and intersection of the set, respectively, A is the number of elements of the set A, and ID is defined as: For Precision, Coverage, Accuracy, and Absolute true, the greater the value meant the better predictive performance. On the contrary, the less Absolute false indicated better predictive performance.
We employed the sensitivity (SN) and specificity (SP) which are the frequently used evaluation metrics in the binary classification. Below are SN and SP definitions: where TP as well as TN are the numbers of the true positive and true negative samples, respectively, and FP as well as FN are the number of false positive and false negative samples, respectively. This is a multi-label and multi-class issue, not a binary classification. Therefore, we viewed it as five binary classifications. Namely, for a given class, all the samples with such class are positive and others are negative. For example, when we computed SN and SP for the AMP, all the peptides of AMP are positive, and peptides with other classes are negative.

Conclusions
Most bioactive peptides play therapeutic roles such as resisting microbes and cancer, being potential, safe, and natural organic substances. We presented a CNN and Bi-LSTM deep learning method for classifying multi-label bioactive peptides from the primary protein sequences. Compared with the latest state-of-the-art method (MLBP), the presented method made two remarkable improvements: stacking CNN and Bi-LSTM module in a parallel manner and utilizing the ResNet. The former allows for extracting multi-scale information from sequences, while the latter keeps the information loss lower in the forward process. The inclusion of both improves the predictive performance. We also found that distribution of amino acids varies with category of bioactive peptide. The amino acid P was enriched in the AHP, the L was enriched in the ADP, while the K was enriched in the ACP. The finding is helpful for determining activities of bioactive peptides.