RoFDT: Identification of Drug–Target Interactions from Protein Sequence and Drug Molecular Structure Using Rotation Forest

Simple Summary Determining the drug–target relationships is the key to modern drug development, and it plays a crucial role in drug side effects research and individual treatment. However, traditional drug target identification by bio-experimental methods is often difficult to develop due to limitations of precision, flux and cost. With the rapid development of bioinformatics and computational biology, the computer-assisted drug–target interaction (DTIs) prediction approach has attracted great attention by researchers as an accurate and quick mean of drug target recognition. In this study, combined with the protein sequence information and drug molecular structure information, a prediction method of DTIs based on machine learning is developed to achieve the purpose of locking targets and saving costs for new drug research. Abstract As the basis for screening drug candidates, the identification of drug–target interactions (DTIs) plays a crucial role in the innovative drugs research. However, due to the inherent constraints of small-scale and time-consuming wet experiments, DTI recognition is usually difficult to carry out. In the present study, we developed a computational approach called RoFDT to predict DTIs by combining feature-weighted Rotation Forest (FwRF) with a protein sequence. In particular, we first encode protein sequences as numerical matrices by Position-Specific Score Matrix (PSSM), then extract their features utilize Pseudo Position-Specific Score Matrix (PsePSSM) and combine them with drug structure information-molecular fingerprints and finally feed them into the FwRF classifier and validate the performance of RoFDT on Enzyme, GPCR, Ion Channel and Nuclear Receptor datasets. In the above dataset, RoFDT achieved 91.68%, 84.72%, 88.11% and 78.33% accuracy, respectively. RoFDT shows excellent performance in comparison with support vector machine models and previous superior approaches. Furthermore, 7 of the top 10 DTIs with RoFDT estimate scores were proven by the relevant database. These results demonstrate that RoFDT can be employed to a powerful predictive approach for DTIs to provide theoretical support for innovative drug discovery.


Introduction
A critical step in innovative drug development is determining the interactions among drugs and targets, which is the forerunner of drug design [1,2]. Drugs play an important role in the human body by interacting with their targets, of which proteins are an essential target. By inhibiting or enhancing the function of the target protein, the drug achieves the goal of treating the disease. Although the advent of high-throughput sequencing methods has provided technical support for determining DTIs and extensive efforts has been made by drug developers, few new drugs are still approved by the Food and Drug Administration (FDA) for marketing each year [3][4][5][6]. The main reason is that the identification of DTIs by wet experimental approaches alone consumes a lot of time and money, and the scale of identification is small. With the development of computational biology, this situation can be greatly alleviated. Computer-aided prediction of DTIs can be executed rapidly at scale, providing reliable candidate drug targets for biological experiments and theoretical support for new drug development [2,4,[7][8][9].
To date, computer-aided prediction-based models for DTIs have been devised by numerous researchers, and they can be roughly classified into two groups: the approach based on network and the approach based on machine learning [10][11][12]. The approach based on network approach typically characterizes the association among targets and drugs as a heterogeneous network, and predict DTI by evaluating network topology node similarity. For instance, the SDTBNI model designed by Wu et al. [13] predicts DTIs by DTI networks, drug and entity substructure linkages in unknown network space. Chu et al. [14] proposed a new DTI prediction method called the DTI-CDF model. This method can not only extract similarity features between drugs, but also extract similarity features between target proteins from heterogeneity graph, which greatly improves the prediction performance. Zhang et al. [15] designed the prediction method of DTIs according to LPLNI, which makes use of the data of neighborhood re-construction. Chu et al. [16] facilitate multilabel classification by introducing the community detection method DTI-MLCD for DTI prediction. The method performs significantly better than other machine learning methods and other existing methods in the updated gold standard dataset. Zong et al. [17] used DeepWalk combined with target-target and drug-drug similarities to accurately predict DTIs with the support of Linked Tripartite Network (LTN) and biomedical related data. The approach based on machine learning mainly uses computer to extract data features and combine with classifier to implement DTIs prediction [18,19]. For example, Peng et al. [20] used a semi-supervised inference way to predict DTIs by combining a PCA-based convex optimization algorithm with information about drug targets.
On the basis of the hypothesis that drugs with chemical similarity have similar bioactivity, the DTIs prediction method of target protein information combined with drug structure information has achieved excellent results. Therefore, this paper designs the machine learning approach for predicting DTIs according to this hypothesis. Specifically, we first extracted the protein sequence features information using the Pseudo Position-Specific Score Matrix (PsePSSM) method, then fused them with drug molecular fingerprint descriptors and finally accurately predicted DTIs with interactions by the feature-weighted rotation forest classifier (FwRF) classifier. We tested the performance of RoFDT in datasets including Enzyme, GPCR, Ion Channel and Nuclear Receptor, and compared them with other feature approaches, classifier approaches and previous methods. The superior results demonstrate that the proposed model has excellent ability to identify DTIs. The frame diagram of rofdt is shown in Figure 1.

Gold Standard Datasets
In the present study, we validated the performance of RoFDT on four gold standard datasets, including Enzyme, GPCR, Ion Channel and Nuclear Receptor. These data were collected from SuperTarget & Matador [21], KEGG BRITE [22], BRENDA [23] and Drug-Bank [24] databases by Yamanishi et al. [25]. In these four datasets, the number of DTIs pairs (drug, target) they contain is (445, 664), (210, 204), (233, 95) and (54, 26), respectively, and the number of DTIs with interaction (positive sample) is 2926, 1476, 635 and 90, respectively [26]. We describe the network of DTIs by a bipartite diagram, where targets or drugs are presented by nodes and their associations are represented by edges. To construct the balanced dataset, we use the random strategy to select the same number of negative samples as positive samples.

Drug Molecular Descriptor
Drug molecular fingerprinting is widely used to characterize drug compounds because it can directly represent the association between molecular properties and structure and does not need their three-dimensional structural information. Drug molecule fingerprinting manages molecular substructures with dictionary strategy. For a particular molecule, the corresponding position of its dictionary is set to 1 when it has a certain substructure and 0 otherwise. Thus, the fingerprint descriptor of a given drug molecule can be constructed. We used molecular fingerprints from PubChem in this study, and the fingerprints property is "PUBCHEM_CACTVS_SUBGRAPHKEYS". The compound molecule is decomposed into 881 substructures, so its fingerprint feature descriptor is also 881-dimensional.

Target Protein Descriptor
In this study, PSSM [27] was used to generate descriptors of protein sequences. PSSM S(i,j) can be characterized as S = ∂ i,j : i = 1 · · · L and j = 1 · · · 20 , which is an L × 20 matrix, in which the length of sequence is L and the types of amino acids are 20. Therefore, the formula of S(i,j) is described as shown below: where σ i,j indicates the probability that the ith residue of the protein is mutated into the jth amino acid during evolution. We obtained PSSM through Position-Specific Iterated BLAST (PSI-BLAST) according to SwansProt dataset [28,29]. PSI-BLAST will calculate the vector indicating the mutational conservatism of 20 different amino acids. To obtain broad and high homologous protein, the parameter e-value and iterations are set to 0.001 and 3, respectively.

Protein Feature Extraction
For better compatibility with the PSSM matrix, we extracted the potential features of proteins using the PsePSSM designed by Chou et al. [30], which can be denoted as below: where e 0 i,j is the score calculated by PSI-BLAST, which score can be positive or negative. The probability of the appropriate mutation in the protein sequence higher than unexpectedly expected is indicated by a positive number, otherwise a negative number. However, since protein sequences of different lengths yield different rows of substrates, we thus need to convert them to a uniform pattern using the following equation: and: where e j represents the average score when protein residue P evolves into a j-type amino acid. To prevent protein P from losing its sequence information during evolution, we improved the equation by constructing pseudo-amino acids, which are described as follows: where e j is the correlation factor of j-type amino acids.

Classification Prediction
In our study, we classify and predict DTIs feature descriptors by FwRF. This classifier has the advantage of increasing the effective feature weights and removing the noise information, which can effectively improve the prediction accuracy. FwRF uses the χ 2 statistical method to obtain the weights of different features, and its formula is as follows: where Y ij is the number of f j categories with the value v i , and its statistics are as follows: β i,j is the expectation of v i and f j , and it can be denoted as below: where N is the total sample size. In feature F, the sample size whose value is v i is recorded as count(F = v i ), and in class C, the sample size whose value is f j is recorded as count C = f j . FwRF first calculates the weights of the features using the χ 2 statistical method, then sorts them in descending order and removes the low-weight features depending on the parameters, and finally uses the newly obtained feature set for classification prediction.
Rotation forest (RF) [31,32] is a widespread classifier. Given a dataset {x i , y i } containing S training samples, where x i is the data and y i is the label, the data x i consist of n features, thus forming a matrix of S × n. The decision tree in RF is presented as D 1 , D 2 , . . . , D N , and there are N in total. The execution steps of RF are as follows.
a. The feature set F is grouped into K-independent parts of the number n k by the appropriate parameter.
b. The new matrix X i,j of the training set X is formed using the corresponding feature columns of D i,j , and 3/4 of the features are selected from it forms matrix X i,j with bootstrap.
c. The coefficient matrix M i,j is obtained through the feature transformation X i,j , and the coefficient matrix M i,j is rotated to generate the rotation matrix R i , which is described as follows: In the classification prediction stage, the classifier D i calculates the confidence level λ j (x) of the test sample x using the following formula and discriminates it as the class with the highest confidence value:

Evaluation Indicator
To better evaluate the RoFDT performance, we used the general evaluation standard of machine learning in this study, which can be denoted as below: where TP, TN, FP and FN, respectively, represent True Positive, True Negative, False Positive and False Negative. Moreover, the receiver operating characteristic (ROC) curve [33][34][35] and area under the ROC curve (AUC) were also calculated to reflect the performance of RoFDT.

Parameter Evaluation
To maximize the RoFDT performance, the grid search approach is employed to verify the FwRF and PsePSSM parameters. When data features are extracted using the PsePSSM algorithm, the information content can be adjusted by changing the parameters in Equation (5) to obtain different feature values. We investigate the effect of different parameters of PsePSSM on the subsequent classification effect in this experiment in order to select the best combination of parameters. The effect of different parameters of the classifier on its prediction accuracy in the enzyme dataset is shown in Figure 2. It can be seen from the figure that the RF classifier achieves the highest accuracy when the feature subset K, the feature selection ratio r and the number of decision trees L are set to 16, 0.8 and 21, respectively. Therefore, we apply them as the optimal parameters in the model. . = + (13) where TP, TN, FP and FN, respectively, represent True Positive, True Negative, False Positive and False Negative. Moreover, the receiver operating characteristic (ROC) curve [33][34][35] and area under the ROC curve (AUC) were also calculated to reflect the performance of RoFDT.

Parameter Evaluation
To maximize the RoFDT performance, the grid search approach is employed to verify the FwRF and PsePSSM parameters. When data features are extracted using the PsePSSM algorithm, the information content can be adjusted by changing the parameters in Equation (5) to obtain different feature values. We investigate the effect of different parameters of PsePSSM on the subsequent classification effect in this experiment in order to select the best combination of parameters. The effect of different parameters of the classifier on its prediction accuracy in the enzyme dataset is shown in Figure 2. It can be seen from the figure that the RF classifier achieves the highest accuracy when the feature subset , the feature selection ratio and the number of decision trees are set to 16, 0.8 and 21, respectively. Therefore, we apply them as the optimal parameters in the model.

Comparison of Different Feature Models
To estimate the influence of the PsePSSM algorithm on the RoFDT model, we compare it with the Local Phase Quantization (LPQ) algorithm model on four gold standard datasets in this part of the experiment. The LPQ algorithm originally described in the article for texture description by Ojansivu and Heikkila [36] and is according to the blur invariance property of the Fourier phase spectrum [37][38][39]. Table 5 lists the 5FCV outcomes produced by LPQ combined with FwRF on gold standard datasets. As observed in Table 5, RoFDT has gained the optimal outcomes in all evaluation indicators. Detailed 5FCV outcomes on four gold standard datasets are aggregated in Tables S1-S4 of Supplementary Materials. For a fair comparison, FwRF was set with the same hyperparameters

Comparison of Different Feature Models
To estimate the influence of the PsePSSM algorithm on the RoFDT model, we compare it with the Local Phase Quantization (LPQ) algorithm model on four gold standard datasets in this part of the experiment. The LPQ algorithm originally described in the article for texture description by Ojansivu and Heikkila [36] and is according to the blur invariance property of the Fourier phase spectrum [37][38][39]. Table 5 lists the 5FCV outcomes produced by LPQ combined with FwRF on gold standard datasets. As observed in Table 5, RoFDT has gained the optimal outcomes in all evaluation indicators. Detailed 5FCV outcomes on four gold standard datasets are aggregated in Tables S1-S4 of Supplementary Materials. For a fair comparison, FwRF was set with the same hyperparameters

Comparison of Different Feature Models
To estimate the influence of the PsePSSM algorithm on the RoFDT model, we compare it with the Local Phase Quantization (LPQ) algorithm model on four gold standard datasets in this part of the experiment. The LPQ algorithm originally described in the article for texture description by Ojansivu and Heikkila [36] and is according to the blur invariance property of the Fourier phase spectrum [37][38][39]. Table 5 lists the 5FCV outcomes produced by LPQ combined with FwRF on gold standard datasets. As observed in Table 5, RoFDT has gained the optimal outcomes in all evaluation indicators. Detailed 5FCV outcomes on four gold standard datasets are aggregated in Tables S1-S4 of Supplementary Materials. For a fair comparison, FwRF was set with the same hyperparameters

Comparison of Different Feature Models
To estimate the influence of the PsePSSM algorithm on the RoFDT model, we compare it with the Local Phase Quantization (LPQ) algorithm model on four gold standard datasets in this part of the experiment. The LPQ algorithm originally described in the article for texture description by Ojansivu and Heikkila [36] and is according to the blur invariance property of the Fourier phase spectrum [37][38][39]. Table 5 lists the 5FCV outcomes produced by LPQ combined with FwRF on gold standard datasets. As observed in Table 5, RoFDT has gained the optimal outcomes in all evaluation indicators. Detailed 5FCV outcomes on four gold standard datasets are aggregated in Tables S1-S4 of Supplementary Materials. For a fair comparison, FwRF was set with the same hyperparameters in the experiment.
From the experimental outcomes, it can be seen that PsePSSM combined with FwRF can effectively promote the model performance.

Classifier Model Comparison
To investigate further the influence of various classifiers on the RoFDT performance, we compare it with the SVM classifier model. The parameters of the SVM were refined, and its hyperparameters g and c were optimized to 0.6 and 0.5, respectively. The optimization outcomes of SVM parameters are shown in detail in Table S9 of the Supplementary Materials. As can be seen in Table 6, RoFDT achieved higher scores in all four gold standard datasets compared to the SVM model. Specifically, RoFDT achieved optimal results in the four gold standard datasets for accuracy, MCC, sensitivity and AUC, but was only slightly less precision than the SVM model in the Ion Channel and Enzyme datasets. Detailed 5FCV experimental outcomes on gold standard datasets are shown in Tables S5-S8 of Supplementary Materials. The experimental results of comparing different classifier models show that the FwRF classifier used by RoFDT can be better compatible with it, which helps to increase the model prediction accuracy.

Comparison with Previous Models
Using the powerful computing power of computer to predict DTIs on a large scale has become increasingly important in the field of new drug research and development. Numerous researchers have constructed different computational models to solve this problem. To further evaluating RoFDT's capabilities, we compare it with these excellent models. Among these excellent models, we chose the model that is also implemented in the four datasets and evaluated using 5FCV. The AUCs generated by these models are listed in Table 7. As seen in table, RoFDT performed well overall, achieving the best results on Enzyme and the second highest outcomes on Ion Channel and GPCR. However, constrained by the sample size of Nuclear Receptor, RoFDT is not sufficiently trained and performs generally in it.

Case Studies
To verify the power of RoFDT to predict unknown DTIs, all known DTI pairs are used to train RoFDT and predict in its unknown space. We validate the top 10 DTIs with the highest prediction scores in SuperTarget [21] and the drug target pairs validated in the SuperTarget database do not contain the data used for training. SuperTarget is a drug target database with a collection of 332,828 DTIs. The outcomes of the case studies are listed in Table 8, where 7 of the top 10 with the best prediction scores were validated by this database. The case study reveals that RoFDT has the capability to competitively predict unknown DTIs. It is interesting to note that although the remaining three pairs of DTIs are not substantiated by the current database, there is a possibility that their relationship will be proved as the study progresses.

Discussion
In the present study, we propose a reliable DTI prediction approach RoFDT by combining protein sequence and drug molecular structure. We first transformed the protein sequence information numerically by PSSM based on its sequence information, and extracted its hidden features using PsePSSM. The drug structure is then encoded as the digital descriptor based on molecular fingerprinting techniques. Finally, the performance of RoFDT was verified using FwRF on four benchmark datasets, and its prediction results were confirmed by the authoritative databases. All these exceptional outcomes show that RoFDT is a valid approach for predicting DTIs and can provide new insights for potential drug discovery.
RoFDT exhibits competitive advantages over previous DTI prediction models. The reason for this is that RoFDT considers that protein sequences provide rich information support for DTI prediction, and its PSSM descriptors are well compatible with PsePSSM feature extraction method to extract its potential features to the maximum extent. In addition, the molecular fingerprint descriptors of drug structures can faithfully represent different drug substructure properties, and thus, have a high characterization capability. Under the above circumstances, RoFDT was able to predict DTI more accurately and provide a more reliable theoretical basis for drug development.
However, RoFDT still has some limitations. For example, the utilization of protein sequence information by RoFDT relies mainly on PSSM, and its richer description needs to be further explored. Second, although the feature extraction method used by RoFDT has achieved better results, it still requires more manual experience to support, and the automation process needs to be better improved. Finally, RoFDT requires more data for training and is not very sensitive to newly discovered drug targets. In future research, we intend to explore more intelligent feature characterization methods to overcome the above-mentioned shortcomings and further enhance the RoFDT performance.

Conclusions
As a pioneering step in drug development, the reliable prediction and identification of DTIs plays an essential element in innovative drug research. In the present study, we combined protein sequence and drug molecular structure to design a computational model for DTIs prediction. The proposed model achieves excellent results in the gold standard datasets including Enzyme, GPCR, Ion Channel and Nuclear Receptor. The model also exhibits strong powerful in comparison with extraction algorithm models, classifier models, and previous methods. In addition, 7 of the top 10 DTIs predicted by the proposed model have been verified by relevant database. These outcomes suggest that the RoFDT model can be employed as a stable and dependable tool to provide valuable target candidates for innovative drug research.
Supplementary Materials: The following supporting information can be downloaded at: https:// www.mdpi.com/article/10.3390/biology11050741/s1, Table S1: 5FCV results of FwRF combined with LPQ on Enzyme dataset, Table S2: 5FCV results of FwRF combined with LPQ on Ion Channel dataset, Table S3: 5FCV results of FwRF combined with LPQ on GPCR dataset, Table S4: 5FCV results of FwRF combined with LPQ on Nuclear Receptor dataset, Table S5: 5FCV results of the SVM classifier model on enzyme dataset, Table S6: 5FCV results of SVM classifier model on Ion Channel dataset, Table S7: 5FCV results of SVM classifier model on GPCR dataset, Table S8: 5FCV results of SVM classifier model on Nuclear Receptor dataset, Table S9: The results of SVM parameter optimization using grid search method on Enzyme dataset, Figure S1: The effect of different PsePSSM Parameters on classifier performance, Figure S2: The effect of different feature selection ratio on classifier performance.