Boosted Prediction of Antihypertensive Peptides Using Deep Learning

: Heart attack and other heart-related diseases are among the main causes of fatalities in the world. These diseases and some other severe problems like kidney failure and paralysis are mainly caused by hypertension. Since bioactive peptides extracted from naturally existing food substances possess antihypertensive activity, these antihypertensive peptides (AHTP) can function as prospective replacements for existing pharmacological drugs with no or fewer side effects. Such naturally existing peptides can be identiﬁed using in-silico approaches. The in-silico methods have been proven to save huge amounts of time and money in the identiﬁcation of effective peptides. The proposed methodology is a deep learning-based in-silico approach for the identiﬁcation of antihypertensive peptides (AHTPs). An ensemble method is proposed that combines convolutional neural network (CNN) and support vector machine (SVM) classiﬁers. Amino acid composition (AAC) and g-gap dipeptide composition (DPC) techniques are used for feature extraction. The proposed methodology has been evaluated on two standard antihypertensive peptide sequence datasets. The model yields 95% accuracy on the benchmarking dataset and 88.9% accuracy on the independent dataset. Comparative analysis is provided to demonstrate that the proposed method outperforms existing state-of-the-art methods on both of the benchmarking and independent datasets.


Introduction
Since hypertension (HT) is a general medical issue that affects about twenty-five percent of the populace, the chances of a person becoming hypertensive increase with age [1]. Frequent high blood pressure generally causes hypertension and it is known as a silent killer. HT typically has no apparent symptoms, as compared to other diseases like fever and asthma. Therefore, it might take a while for a person to be diagnosed as hypertensive. Delayed diagnosis may cause severe medical issues like stroke, heart-related diseases, and other significant abnormalities like renal failure, multi-infarct dementia, harm to brain organs, and the cardiovascular illnesses [2,3].
The high prevalence and dangerous effects of hypertension signify the need to discover novel treatments and drugs to lessen/eradicate its consequences. Currently, there are many drugs for HT available on the market, like angiotensin-converting enzyme (ACE) inhibitors, beta-blockers, calcium channel blockers, k+ sparing diuretics, loop diuretics, and thiazide diuretics [4]. Although these drugs have been proven to be beneficial for the treatment of HT, they may cause notable side-effects, such as hypotension, metabolic alkalosis, depression, hallucinations, vivid dreams, hyperglycemia, angioneurotic edema, cough, ankle edema, tachycardia, headache, urinary urgency, functional renal insufficiency, hyponatremia, impotence, and insomnia [5]. Therefore, the discovery of secure medicines to treat and/or reduce the harmful effects of hypertension is indispensable. A medicine is considered secure if it has no or minimum side-effects.
The angiotensin-converting enzyme (ACE) is an essential element of the renin-angiotensin system (RAS). It is responsible for balancing the fluid volumes in the body and thus controlling the blood pressure. It regulates the conversion of angiotensin-I (a decapeptide) into the active angiotensin-II (an octapeptide), which constricts the blood veins. It is a powerful naphazoline hormone and a mineralocorticoid-animating peptide that controls blood pressure.
Thus, ACE is indirectly responsible for high blood pressure by causing blood veins to constrict. Many biological peptides have the potential to inhibit ACE in the reninangiotensin system, and are thus useful to prevent and treat HT [6]. A large number of such biological peptides are found in the proteins from foods prepared from animals and plants such as, fish, cheese, milk, egg, corn, algae, microorganisms, insects, fungi, wakame, amaranth, soybean, wheat, chicken, snake, bovine, etc. [7]. Such peptides are called antihypertensive peptides (AHTPs), and their identification can lead to the development of more beneficial drugs (with less side effects) for the treatment of HT. However, the identification of a peptide that can function as an antihypertensive peptide is an expensive task in terms of time and cost. In-silico approaches can be of great help in developing systems that can absolutely filter out the non-antihypertensive peptides (non-AHTP) and provide a set of peptides with the potential to inhibit hypertensive conditions. There is a great need to develop such systems that can produce highly accurate predictions of such peptides. Recently, limited studies have demonstrated the power of machine learning (ML)-based methods to develop AHTP prediction systems.
Firstly, Wang et al. quantitatively defined the relationship among target molecular structures and biological activities, and created a quantitative structure-activity relationship (QSARs) model to confirm the structure of the ACE-inhibitor peptides and the biological activities of them. To create this model, they utilized g-scale features and used partial least square (PLS) regression-based methods. They have created this model on the basis of very small peptides (e.g., peptides with lengths of two and three), and can only predict the inhibitory activity of these tiny peptides, which is the main disadvantage of this study [8]. Another approach was built by Kumar et al. in 2015 in this area [9]. The dataset was divided into four different categories to extract features. The categories were named (1) tiny (dipeptides and tripeptides), (2) small (tetrapeptides, pentapeptides, and hexapeptides), (3) medium (i.e., sizes ranging between seven and twelve), and (4) large peptides (greater than twelve amino acids). For tiny peptides, chemical descriptors were extracted to generate support vector machine (SVM)-based regression models, and they achieved a correlation of 0.701 for dipeptides and 0.543 for tripeptides. For smaller peptides, SVM-based classification models were developed, and the maximum obtained accuracies were 76.67% for tetrapeptide, 72.04% for pentapeptide, and 77.39% for hexapeptides. For medium and large peptides, amino acid composition features were extracted to develop SVM-based classification models, and accuracies of 82.61% for medium peptides and 84.21% for large peptides were attained. Moreover, a web-based platform called AHTpin was also created to screen, predict, and design AHTPs. Win et al. developed a computerized AHTPs prediction system [10] that employs random forest (RF) models. The models are trained by using the groupings of amino acid composition, dipeptide composition, and pseudo amino acid composition feature encoding techniques. The system demonstrates a marginal improvement over AHTpin, with an accuracy of 84.73% when run on an independent test dataset. Furthermore, the feature importance analysis emphasized the preference at the C-terminal of the proline amino acids and non-polar amino acids, and also the capacity of little peptides for vigorous activity. An online web platform (called PAAP) is also available for public use of the proposed model it contains. The mAHTpred meta-predictor is another approach to classifying AHTPs [11]. To identify AHTPs, Manavalan et al. developed mAHTpred by using eight feature encoding schemes to construct 51 feature vectors from two different datasets (benchmarking dataset and independent dataset). Extremely randomized tree (ERT)-based models were created by using these 51 feature vectors. A new feature vector consisting of the predicted probabilities of AHTPs was calculated by using the above mentioned ERT-based models. The new feature vector was then utilized as an input for four different ML algorithms: SVM, gradient boosting (GB), RF, and ERT. Final predictions were made by using an ensemble of SVM, GB, RF, and ERT models.
AHTP prediction has been realized in [12], using recursive feature elimination. This work generates and feeds optimal features into the ensemble of four classification algorithms (SVM, C4.5 Decision Tree, random forest (RF), and extreme gradient boosting (XGBoost)) in order to achieve the final prediction. Deep-AmPEP30 is a recent deep learning-based model, devised for the prediction of short-length antimicrobial peptides (AMP), which is another important bioactive peptide sequence [13]. Deep-AmPEP30 employs CNN on a reduced set of amino acid composition (AAC) features. This group experimented with a benchmark dataset consisting of balanced classes, with accuracy scores of 77% and 85% gained by both Area under the Receiver Operating Characteristic Curve (AUC-ROC) and Area under the Precision-Recall Curve (AUC-PR), respectively. In this research activity, a deep learning-based antihypertensive peptide predictor is presented. The proposed approach uses the standard datasets used in [9] and [11]. Features are generated by using two feature-encoding techniques: amino acid composition (AAC) and g-gap dipeptide composition (g-gap DPC). Dipeptide composition features are further represented as red, green, and blue (RGB) images. An RGB image is created against each dataset. In the literature, ensemble-based methods have been successfully applied for the task of classification [11,12]. In such methods, multiple classifiers are combined in parallel or in sequence to boost the classification performance. In the proposed method as well, an ensemble of classifiers is used to make a boosted prediction by employing a convolutional neural network (CNN) followed by a support vector machine (SVM). Four different CNN models are trained on the generated image dataset. The predicted outputs of these four CNN models are combined with the AAC features for every sequence, resulting in new feature vectors. These new feature vectors are then used to train an SVM model for the final classification of the peptide sequences as either antihypertensive or not. The proposed predictor is evaluated using a 10-fold cross-validation method, and achieves an accuracy of 95% on the benchmarking dataset and 88.9% on the independent dataset.

Materials and Methods
This work is carried out in the following six steps: (1) data acquisition and analysis, (2) feature extraction, (3) RGB image generation, (4) training the CNN models, (5) generating new feature vectors on the basis of the CNN model's output and the ACC feature, and (6) training the SVM classifier. Details of each step are provided in this section. The overall methodology is outlined in Figure 1.

Data Acquisition and Analysis
Two datasets are used in this work. One is considered as the benchmarking dataset [9] and the other one is taken as an independent dataset [11]. We used the same benchmarking dataset during the model construction as was used in previous works [9,11] to compare our approach. The independent dataset was used for test data during the evaluation.

Data Acquisition and Analysis
Two datasets are used in this work. One [9] and the other one is taken as an indep benchmarking dataset during the model con [9,11] to compare our approach. The independ the evaluation. Both datasets consist of antih positive class samples and non-antihyperte negative class samples. All peptide sequences a The benchmarking dataset contains 913 samp the negative class. Similarly, the independent d Data analysis of both datasets, and the total nu amino acid residues (A, R, N, D, C, Q, E, G, datasets, is given in Table 1. Statistics for the peptide sequences are presented in Table 2. It shows the minimum, maximum and average lengths of the peptide sequences in both datasets.

Feature Extraction
Features are extracted from the peptide sequence dataset by applying two feature encoding schemes (i.e., AAC and g-gap DPC). In total, 20 features are extracted by using AAC and 400 features (of shape 20-by-20) by using each DPC with a gap of zero, one, two, and three, respectively. Overall, there are 1620 features for each sample of the dataset. The feature encoding schemes are described in detail below.

Amino Acid Composition (AAC)
There are twenty standard amino acids that repeat in each protein sequence. Amino acid composition is a feature extraction technique that represents the peptide by calculating the percentage of each amino acid in a given peptide sequence. Figure 2 shows an example of calculating the AAC features for a sample peptide sequence. In this way, we get a feature vector of size twenty.

G-Gap Dipeptide Composition (DPC)
A dipeptide is a composite of two amino acid residues, with or without a gap between them. DPC is computed as the ratio of the number of occurrences of a dipeptide to the entire length of the sequence. It can also be computed with a gap of amino acids, where the value of g ranges from 0 to 3. Such a DPC feature is called g-gap dipeptide composition. DPC generates a 20-by-20 matrix that contains 400 features for each value of the g-gap, as shown in Figure 3 (for g-gap = 0). There are twenty standard amino acids that repeat in each protein sequence. Amino acid composition is a feature extraction technique that represents the peptide by calculating the percentage of each amino acid in a given peptide sequence. Figure 2 shows an example of calculating the AAC features for a sample peptide sequence. In this way, we get a feature vector of size twenty.

G-Gap Dipeptide Composition (DPC)
A dipeptide is a composite of two amino acid residues, with or without a gap between them. DPC is computed as the ratio of the number of occurrences of a dipeptide to the entire length of the sequence. It can also be computed with a gap of amino acids, where the value of g ranges from 0 to 3. Such a DPC feature is called g-gap dipeptide composition. DPC generates a 20-by-20 matrix that contains 400 features for each value of the g-gap, as shown in Figure 3 (for g-gap = 0).

RGB Image Generation
DPC-generated 20-by-20 matrices are used to generate RGB images. Each image consists of three (20 × 20) matrices. The values of each matrix are taken as pixels and the three matrices represent the three colors: red, green, and blue (RGB). Matrices with

RGB Image Generation
DPC-generated 20-by-20 matrices are used to generate RGB images. Each image consists of three (20 × 20) matrices. The values of each matrix are taken as pixels and the three matrices represent the three colors: red, green, and blue (RGB). Matrices with different combinations of g-gaps are used to generate distinct images for each g-gap, as given in Table 3. For example, the first image is generated by considering a DPC with g-gap = 0 as the red color matrix, a DPC with g-gap = 1 as the green color matrix, and a DPC with g-gap = 2 as the blue color matrix.  Figure 3 shows that the values in all matrices are less than 1, and thus could not be properly visualized in the image. To visualize the image properly, we have replaced all the non-zero values of DPC features with 255 (i.e., the largest value of pixel). Figure 4 shows the procedure for converting the peptide to an image for the first combination on the sample peptide. Using this procedure, we generated four image datasets, as shown in Figure 5.

Trainig CNN Models
In this step, the convolutional neural network (CNN) is used, which is one of the most prominently utilized approaches for image analysis [14]. It first extracts features from an image by applying convolutional layers, i.e., by striking the number of feature maps (kernels) on it. These feature maps are considered as features of the image, and then pooling layers are applied to reduce the dimensionality of each feature map without

Trainig CNN Models
In this step, the convolutional neural network (CNN) is used, which is one of the most prominently utilized approaches for image analysis [14]. It first extracts features from an image by applying convolutional layers, i.e., by striking the number of feature maps (kernels) on it. These feature maps are considered as features of the image, and then pooling layers are applied to reduce the dimensionality of each feature map without losing useful information. This is followed by a classification (dense) layer, which is a conventional artificial neural network [14].
In the proposed approach, each image generated in the previous step 2.3 is used as an input for the CNN. Four CNN models are generated for the classification of each sample peptide. The CNN architecture and parameters used are defined in Table 4 and SVM parameters are defined in Table 5. We take the prediction value of each four CNN models, which is either 0 if the predicted class is negative or 1 if the predicted class is positive. The performance evaluation of CNN models is presented in Table 6. Although the CNN models demonstrate reasonable accuracy, the performance was inferior to the benchmark approaches in [9] and [11]. Thus, we apply a boosted prediction approach by combining another promising classifier (SVM) which outperforms the benchmark approaches. SVM has gained prominent success in bioinformatics, including in the classification of protein sequences [15,16]. A proposed boosting step is described in the following section.

Parameters Values
Kernel Radial Basis Function Gemma 0.5 Constant (c) 2

Boosting as Input Prediction by Ensembling an SVM Classifier Using a Combination of CNN Models as Output and AAC Features as Input
A new feature vector is generated by combining the CNN models' outputs and the 20 AAC features. In this way, we get a new feature vector against each sample. The length of this feature vector is 24.
These 24 features are presented to the SVM for final classification. The parameters used for the SVM classifier are presented in Table 5. The performance evaluation is discussed in Section 3.1 and Table 7. The obtained results show that the proposed method using boosting outperforms the CNN models as well as the existing approaches.

Computational Environments
The proposed approach was developed on Google Colaboratory (which is a cloudbased online Jupyter Notebook platform) by using python programming language. We chose Google Colaboratory because we did not have to install python and its libraries manually, and it has a diverse range of python libraries already installed on the cloud. It also provides Graphics Processing Unit (GPU), which makes the training process relatively fast. For deep learning, we have used python's library Keras with the TensorFlow backend.

Results and Discussion
The experimental results of the proposed approach are discussed in this section. A comparative analysis with existing methods is also provided to demonstrate that the proposed approach outperforms the existing approaches.

Model Evaluation Results on Benchmarking and Independent Dataset
The proposed model is evaluated on both benchmarking and independent datasets, by using the 10-fold cross validation. Commonly used performance metrics are calculated. The metrics include accuracy (ACC), sensitivity (Sn), specificity (Sp), Matthews's correlation coefficient (MCC), and area under the curve (AUC). Four CNN models are trained on both datasets, and the results of the performance metrics are shown in Table 6. The outputs of the CNN models and the AAC features are then used as inputs to train the SVM model, which is our final predictor. Table 7 shows the performance of the SVM model on both datasets.

Comparative Analysis and Discussion
The performance of this work is compared with three existing approaches, including AHTpin [9,10] and mAHTPred [11]. A detailed description of the existing approaches is provided in the introduction section. AHTpin consists of two prediction models; one is based on AAC features and the other is based on atomic composition features. Our results are compared with both of these two models. mAHTPred achieved the highest accuracy as compared to the previous two techniques [11]. We compared our results with those of both mAHTPred and the other two techniques, by running them on the same datasets.
The comparison results show that the proposed boosted predictor outperforms the existing techniques, as shown in Table 8. The boosted predictor for AHTPs achieved a better performance on both datasets, in terms of ACC, Sp, Sn, MCC, and AUC. The ACC and MCC results are approximately 9-21% higher than those of the previous approaches. The independent dataset was created by Manavalan et al. and we used it to check the robustness of mAHTPred [11]. Comparison results of independent dataset are also provided in Table 8.

Conclusions
Hypertension is connected to numerous diseases such as cancer, heart attack, renal failure, and paralysis. Bioactive peptides are derived naturally and have antihypertensive activity. Bioactive peptides work as encouraging substitutes to pharmacological medicines. Such peptides are useful, but to find out whether the peptides have antihypertensive characteristics or not is an expensive process. Bioinformatics could be used to build an automated system to identify the antihypertensive peptides and some solutions have already been proposed [9,11]. However, there is still room for performance improvement in such tools. In this paper, we present an automated antihypertensive peptides prediction system. It takes a peptide sequence as the input and predicts whether the given peptide is antihypertensive or not. The comparative results demonstrate that the proposed approach outperforms the existing state-of-the-art approaches. Experimental results show that the proposed approach yields high accuracies on standard datasets, as compared to the previous approaches.
In future, the proposed approach can be applied to other types of biological datasets. Exploration of other deep learning techniques, performance optimization, and the launching of a web service are also promising future directions.
Author Contributions: The idea conceptualization, data curation and methodology design were done by M.T.H. and S.M. Original draft was prepared by A.R. and this was proofread by A.K. Implementation was performed by A.R. and validation was done by A.K. Project administration was done by G.M. The review and editing were performed by G.M. and M.J. Funding was acquired by M.J. All authors have read and agreed to the published version of the manuscript.
Funding: This work was supported by GIST Research Project grant funded by the GIST in 2021. Support of UMT is also acknowledged.