In silico tool for predicting and designing Blood-Brain Barrier Penetrating Peptides

Blood-brain-barrier is a major obstacle in treating brain-related disorders as it does not allow to deliver drugs in the brain. In order to facilitate delivery of drugs in brain, we developed a method for predicting blood-brain-barrier penetrating peptides. These blood-brain barriers penetrating peptides (B3PPs) can act as therapeutic as well as drug delivery agents. We trained, tested, and evaluated our models on blood-brain-barrier peptides obtained from the B3Pdb database. First, we compute a wide range of peptide features then we select relevant peptide features. Finally, we developed numerous machine learning-based models for predicting blood-brain-barrier peptides using selected features. Our model based on random forest performed best on the top 80 selected features and achieved a maximum 85.08% accuracy with 0.93 AUROC. We also developed a web server, B3pred that implements our best models. It has three major modules that allow users to; i) predict B3PPs, ii) scanning B3PPs in a protein sequence, and iii) designing B3PPs using analogs. Our web server and standalone software is freely available at https://webs.iiitd.edu.in/raghava/b3pred/.


Introduction
The blood-brain barrier (BBB) is the primary barrier between the brain's interstitial fluid and the blood. It makes the connection between the central nervous system (CNS) and the peripheral nervous system (PNS) [1][2][3][4]. The neurovascular unit (NVU) is the structural and functional unit of the BBB, formed by the neurons, macrophages, endothelial cells, astrocytes, and pericytes [5] (Fig.1). NUV regulates the biochemical environment between the blood and the brain, which is essential for neural functions. The endothelial cells of the NUV allow the entry or exit of the molecules like glucose, amino-acids, proteins/peptides in the CNS [6][7][8].
In the last few decades, researchers have made many attempts to develop drug delivery systems that can deliver drugs in the brain. Despite advances made by the scientific community in developing drug delivery systems, it is still challenging to penetrate the BBB [9].
In the past, researchers have attempted to develop peptides/proteins-based drug delivery vehicles. In this approach, a major challenge is to identify peptides that can penetrate the BBB [10]. In addition, researchers are also exploring peptide-based therapeutics to treat CNSassociated diseases such as neurodegenerative disorders like Parkinson's disease, Alzheimer's disease [11,12], and glioblastoma [13]. It means peptides can be used as therapeutic agents as well as drug delivery vehicles. In recent studies, numerous peptides such as shuttle peptides [14], self-assembled peptides [15], and peptide-decorated nanoparticles [16] have been used for efficient drug delivery to the brain. Some neuropeptides are utilized as potential therapeutic targets against many neurological diseases such as epilepsy [17,18], depression [19,20], and neuroimmune disorders [21]. Due to the low toxicity of these peptides, they may act as potential peptide-based drugs candidates against neurological diseases. The major limitation of these peptide-based drugs is its low bioavailability, short half-life [22], and penetration of BBB [23].
For example, tumor homing peptides (THPs) [24] and cell-penetrating peptides (CPPs) [25] can be used as drug delivery vehicles [26,27]. The tumor homing peptides need a carrier to cross the BBB, while selected CPPs can directly pass through the BBB [28]. Fig. 1 here

Fig. 1 Representation of the Blood-Brain Barrier and B3PPs to cross into CNS
CPPs are short peptides, act as molecular delivery vehicles, and are able to deliver various therapeutic molecules inside a cell [29] [30]. There are CPPs that can even cross the bloodbrain barrier are called blood-brain barrier penetrating peptides (B3PPs). These B3PPs can be used to deliver several cargo molecules (e.g., peptides/proteins, siRNA, plasmid DNA) in the brain [31][32][33][34]. Mainly these peptides are obtained from the naturally occurring proteins/peptides like signal peptides, RNA/DNA-binding proteins, viral proteins, antimicrobial peptides, etc. [35]. Several studies have shown that B3PPs may be synthesized chemically or designed with rDNA technology [36][37][38]; to enhance the stability and half-life of the B3PPs [39]. In the past, several methods have been developed for predicting cellpenetrating peptides, such as cellPPD [40], SkipCPP-Pred [41], CPPred-RF [42], KELM-CPPpred [43], CellPPDMod [44], and CPPred-FL [45]. In addition, various methods have been developed for predicting chemical-based drug delivery vehicles to cross the blood-brain barrier [46][47][48]. In contrast, a limited attempt has been made to develop methods to predict B3PPs.
In this study, we have developed a computational tool named "B3Pred" for predicting B3PPs with high reliability and precision. This method is able to classify BBPs/non-BBPs and CPPs/BBPs and uses a large dataset for training and validation. We used three datasets, i.e., Dataset_1 (269 B3PPs and 269 CPPs), Dataset_2 (269 B3PPs, and 269 non-B3PPs), and Dataset_3 (269 B3PPs and 2690 non-B3PPs) for training and validation. We have used more than 9000 descriptors/features for the generation of prediction models using several machine learning techniques such as RF, DT, LR, XGB, SVM, and GBM. Further, in order to serve the scientific community working in this era, we have provided a web server and a standalone package, which is freely available at (https://webs.iiitd.edu.in/raghava/b3pred/)

Dataset Collection
In this study, we have collected 465 blood-brain barrier penetrating peptides (B3PPs) from the B3Pdb database (https://webs.iiitd.edu.in/raghava/b3pdb/). We consider B3PPs having a length of more than five amino acid (AA) residues and less than or equal to 30 AA residues.
For the positive dataset, we got 269 unique B3PPs. The major challenge of this type of study is to generate an authenticated negative dataset. We have used three negative datasets in this study. Firstly, we collected unique 269 cell-penetrating peptides (CPPs) [50] other than B3PPs and called them non-B3PPs or negative dataset1. In negative dataset-2, we randomly generated 269 non-B3PPs from the Swiss-Prot database [51]. Our third negative dataset is ten times the positive dataset, i.e., 2690 unique non-B3PPs randomly generated using the Swiss-Prot database. Finally, we got three datasets, i.e., Dataset_1 (269 B3PPs and 269 CPPs), Dataset_2 (269 B3PPs, and 269 non-B3PPs), and Dataset_3 (269 B3PPs and 2690 non-B3PPs).

Amino Acid Composition
Amino acid composition (AAC) analysis of peptides helps us find out whether there is any amino acid compositional similarity or any compositional differences in different datasets. We compared the amino acid composition of B3PPs, CPPs, and randomly generated peptides for the negative dataset. The following equation calculates AAC: Where AAC (i) is the percentage composition of the amino acid (i); AARi is the number of residues of type i, and TNR is the total number of residues in the peptide [52].

Two Sample Logo
Two sample logo (TSL) tool [53] used in this study to identify the amino-acid preference at a specific position in the peptide sequences. This tool needs an input amino-acid sequence vector of fixed length since the minimum size of peptides in all datasets is five residues; hence we select five residues from the N-terminal and five amino-acids from the C-terminal of the peptide sequences. To create a fixed input vector, the N-terminus side residues and C-terminus residues were grouped together to generate a sequence of 10 amino-acid residues. We used 10residues sequences generated from our dataset peptides to develop TSL. To build two sample logos, we have used all B3PPs and non-B3PPs of three different negative datasets.

Generation of Peptide Features
In order to calculate a wide range of features from the protein or peptide sequences, we use the Pfeature package [54]. Pfeature is used to generate thousands of features/descriptors. Currently,

Feature Selection
In this study, we have used the SVC-L1 feature selection technique to extract an essential set of features from all the datasets. We choose the SVC-L1 method because it is much faster than other feature selection methods [55]. This method applies the L1 penalty to select a relevant set of features after selecting the non-zero coefficients. SVC-L1 mainly considers regularization and loss function. During the optimization process, the L1 regularization

Feature Ranking
After the selection of an important set of features, we rank the features on the basis of their importance of classification using a feature selector method. The Feature-selector method is based on a decision tree-like algorithm and uses Light Gradient Boosting Machine (LightGBM) [57]. It computes the rank of each feature on the basis of the feature that is used to split the dataset across all the trees. Further, the top-most ranked features for each dataset were used in different machine learning techniques for the classification of B3PPs and non-B3PPs.

Machine Learning Techniques
In order to classify B3PPs and non-B3PPs, we have used several machine learning algorithms.
In this study, we mainly implement Decision tree (DT), Random Forest (RF), Logistic Regression (LR), k-nearest neighbors (KNN), Gaussian Naive Bayes (GNB), XGBoost (XGB), and Support Vector Classifier (SVC) machine learning classifiers. The different classification methods were implemented with the help of a python-based library known as Scikit-learn [58].
DT algorithms work on the basis of non-parametric supervised learning models. The major aim of the classifier is to identify the output instance by learning various decision rules provided in the form of input data [59]. The GNB method is a probabilistic classifier and develops on the Bayes theorem. It was based on the assumption that the consecutive variable of every group follows Gaussian distribution or normal distribution [60]. Random forest is an ensemble-based classifier; which predicts a single tree as a response variable by training the number of decision trees. It also controls the overfitting of the models [61]. LR technique is used to achieve the logistic/logit model, which gives the likelihood of an event to happen. It applies a logistic function to predict the response variable or occurrence of a class [62]. KNN method is an instance-based classifier. It usually collects the instances of the training dataset.
Its prediction is based on the maximum number of votes given to a particular class closer to the nearest neighbor data point [63]. XGB classifier uses the scalable tree boosting algorithm, in which an iterative approach is used for the prediction of the final output [64]. SVC developed on the library of support vector machines (libsvm). It usually fits the data points provided as input features and provides the most suitable fit of a hyperplane that categorizes the data into two classes [65].

Cross-validation Techniques
In order to train, test, and evaluate our classification model, we used 5-fold cross-validation and external validation techniques. Several classification and prediction methods use 80:20 splitting of the complete dataset for training and validation purposes in the last few decades [66,67]. In the current study, we have implemented a similar strategy to evaluate our classification model. For each dataset, 80% of data is used for training, and the remaining 20% used for external validation. Further, we apply the 5-fold cross-validation techniques on the training dataset. The training data is equally divided into five sets/folds in which four folds were used for training, and the fifth fold is used for testing the model. This process is iteratively done five times, in which each set is used for testing the model. The final performance is computed by taking the average of each set.

Performance Evaluation Parameters
We used standard evaluation parameters to compute the performance of classification models.
Threshold-dependent and independent parameters were used in this study. The performance of the models is calculated by threshold-dependent parameters, such as sensitivity (Sens), accuracy (Acc), and specificity (Spec

Web server Implementation
We have developed a web server named "B3Pred" (https://webs.iiitd.edu.in/raghava/b3pred/) to identify blood-brain barrier penetrating peptides and non-B3PPs. We used HTML5, JAVA, CSS3, and PHP scripts to develop the front-end and back-end of the webserver. B3Pred server is compatible with all the latest devices like modern devices such as mobile, tablet, iMac, and desktop. It mainly incorporates predict, design, and protein scan modules.

Amino Acid Composition Analysis
Amino Acid Composition analysis shows the percentage composition of specific peptides in the particular dataset as mentioned above. The compositional difference is clearly visible in the graph, which represents the respective dataset's percentage AAC (Fig. 2). Arginine is highest in CPPs and B3PPs, which shows that it plays a role in the penetration of peptides into the cells. Tyrosine, an aromatic amino acid, is high in B3PPs as compared to other datasets. Unique amino acids Proline and Glycine are prevalent in B3PPs as a contrast to the other datasets.
Insert Fig. 2 here

Amino Acid Position Analysis
The preferential amino acid position is denoted in Fig. 3, which is generated with the help of Two sample logo software. The preferred position of amino acids can be seen in the figure, and it helps us in understanding and designing the B3PPs of research interest. Tyrosine, glycine, and arginine are the most highly preferred amino acids in the first three positions in B3PPs.
Two sample logo suggest that Tyrosine, glycine, arginine, and lysine are more preferred throughout the B3PPs 10 amino acid length. Hence these amino acids play a crucial role in the composition and position of amino acids in B3PPs.
Insert Fig. 3 here

B3PPs Prediction Methods on different datasets
B3PPs prediction methods were prepared using various machine learning techniques such as After selecting features on all the datasets, we performed the machine learning techniques using different methods. We analysed, the output on all the datasets and interpreted the results.  Table 3).
Insert Table 3 here   Table 3.

Various Machine learning method's results on Dataset_3
We also plotted the AUROC curves for the final dataset, i.e., Dataset_3. The best performing method among all the methods were selected for the demonstration of AUROC. We can clearly demark from the AUROC plot that all the methods performed well on the training dataset and validation dataset except GNB and DT (Fig. 4).
Insert Fig. 4 here One of the major objectives of this study is to facilitate the scientific community in discovering B3PPs based drug delivery vehicles that can deliver cargos in brain tissues. Thus, we developed a standalone software as well as a web-based service to assist the researcher in finding new B3PPs or designing efficient B3PPs. Our web server B3Pred has three major modules, namely Predict, Design, and Scan. Predict module of B3pred allow users to predict B3PPs in a set of protein sequences submitted by the user. It allows users to select models developed on any dataset used in this study. The design module of B3pred was developed to discover the most promiscuous B3PPs for a given peptide. This module first generates all possible analogs of a peptide then predicts the score for each analog. It also allows users to short analogs based on their score to select the best analog of a peptide. The protein scan module provides the facility to identity the B3PPs region in the query protein of the user. It allows the user to select the length of the peptide segment to be scanned in the protein sequence submitted by the user. In addition to web-based service, we also developed standalone software for searching B3PPs at a large scale, like searching B3Ps at the genome level.
Insert Fig. 5 here

Comparison with the existing method
In order to understand the benefits and drawbacks of the new method, it is crucial to compare the new method with existing methods. However, many methods have been developed in the past to predict the BBB penetrating potential of chemical compounds. Best of our knowledge, and scan facility. In addition, B3Pred is also available as standalone software so that users can run on their local machine on a large scale.

Discussion & Conclusion
The Blood-Brain Barrier (BBB) is the natural guard of the brain, which inhibits unwanted molecules to cross the brain tissues [68]. Unfortunately, brain-related or neurological diseases have increased tremendously in the last few decades. In order to treat neurological disorders, such as Alzheimer's disease, Parkinson's disease, neuroinflammation, there is a need for drugs that can be used to treat brain-associated diseases. Due to advancements in technology, researchers are able to discover drugs to treat these disorders in vitro. One of the major hurdles in treating brain-associated disease is delivering drugs in brain tissue, as the blood-brain barrier inhibits these drug molecules from reaching brain tissue [69]. The transportation or delivery of the therapeutic molecules penetrating the barriers of the brain is the bottleneck challenge in treating brain tumors and CNS diseases [70].
Several in silico methods have been developed to predict and improve the delivery of the therapeutic molecules circumventing BBB. Like other therapeutic molecules, blood-brain barrier penetrating peptides (B3PPs) have a significant role in neurological disorders. A study has shown that D-Ala-Peptide T-amide (DAPTA), or peptide T is an antiviral peptide that can cross the blood-brain barrier. Intranasal Peptide T obtained from the envelope protein of the human immunodeficiency virus (HIV). Peptide shows antiviral properties and usually inhibits the chemokine (CCR5) receptors and also acts as B3PPs [71,72]. Researchers explained, AH-D is an amphipathic α-helical BBB penetrating peptide that acts as a therapeutic agent for deadly viruses. It is used as a direct antiviral agent (DAA) to inhibit specific viral proteins. A recent study suggests that potential antiviral AH-D is a target against deadly viruses such as chikungunya virus, Zika, dengue, and yellow fever, with different inhibitory and cytotoxic concentrations [73][74][75][76]. These studies provide information that such peptides can be helpful in viral infections, along with if any neurological complications arise due to viruses. These peptides can be used as therapeutic substitutes for antiviral drugs which are unable to cross the brain. This may help in controlling the neurological complications that arose due to Covid-19 [77].
In the present scenario, there is the utmost need to develop an efficient prediction tool that can accurately predict the peptides that are having the property of penetrating through the bloodbrain barrier. To facilitate the researchers working in this area, we proposed a method named B3pred for predicting B3PPs. We have used more than 9000 descriptors to build the prediction model. The RF-based model has achieved the maximum AUROC of 0.93 and 0.90 on training and validation datasets, respectively. We have also developed the free webserver name B3pred and have incorporated various modules such as prediction, design, and scan for B3PPs, to analyze and design the desired B3PPs. We believe that our method would help in the accurate prediction of B3PPs and aid the scientific community working in this area.  There is no Conflict of Interest.