Prediction of Drug–Target Interactions by Combining Dual-Tree Complex Wavelet Transform with Ensemble Learning Method

Identification of drug–target interactions (DTIs) is vital for drug discovery. However, traditional biological approaches have some unavoidable shortcomings, such as being time consuming and expensive. Therefore, there is an urgent need to develop novel and effective computational methods to predict DTIs in order to shorten the development cycles of new drugs. In this study, we present a novel computational approach to identify DTIs, which uses protein sequence information and the dual-tree complex wavelet transform (DTCWT). More specifically, a position-specific scoring matrix (PSSM) was performed on the target protein sequence to obtain its evolutionary information. Then, DTCWT was used to extract representative features from the PSSM, which were then combined with the drug fingerprint features to form the feature descriptors. Finally, these descriptors were sent to the Rotation Forest (RoF) model for classification. A 5-fold cross validation (CV) was adopted on four datasets (Enzyme, Ion Channel, GPCRs (G-protein-coupled receptors), and NRs (Nuclear Receptors)) to validate the proposed model; our method yielded high average accuracies of 89.21%, 85.49%, 81.02%, and 74.44%, respectively. To further verify the performance of our model, we compared the RoF classifier with two state-of-the-art algorithms: the support vector machine (SVM) and the k-nearest neighbor (KNN) classifier. We also compared it with some other published methods. Moreover, the prediction results for the independent dataset further indicated that our method is effective for predicting potential DTIs. Thus, we believe that our method is suitable for facilitating drug discovery and development.


Introduction
Detecting the interactions between compounds (drugs, molecules, ligands) and proteins (targets) is one of the most active parts of the genomic drug development field, as it plays a critical role during the discovery of novel drug candidates [1]. According to statistics from the US Food and Drug Administration (FDA), it takes at least billions of dollars to develop a new drug [2]. However, only a few drug candidates will be allowed to enter the market, as most of them fail in clinical trials and show uncertain side effects [3]. Furthermore, some studies have reported that the interactions between target proteins and drugs have a significant impact on the toxic side-effects of the drug candidates [4]. This makes the study of drug-target interactions (DTIs) very useful for detecting the toxicity of candidate drugs. Over the past few years, numerous experimental approaches have been introduced to identify DTIs, but few of them have been tested and detected as interactive [5,6]. In addition, these traditional experimental-based methods need to address the problem of high false-positive and false-negative rates [7]. For these reasons, there was a strong demand for the development of novel computational approaches to shorten the drug development cycle and reduce the time taken to detect drug-target pairs [8].
With the rapid increase in publicly available chemical and biological data, various types of related databases based on the relationships between drugs and proteins (targets) have been established, such as DrugBank [9], KEGG [10], TTD [11], and SuperTarget & Matador [12]. These public databases store a large amount of DTI information, and it is essential for researchers to develop novel and robust computational methods for detecting potential DTIs on a genome-wide scale.
To date, many computational approaches combining biological information and descriptor information have been used, such as Docking simulation [13,14], ligand-based methods [15], and literature text mining methods [16], which can be employed to identify drug-target interactions. However, these methods also have some inevitable limitations. Docking simulation is an effective molecular model which can use the dynamic simulation to predict the positive interactions between drug molecules and target proteins. It usually needs the information about the 3D structural data of the targets, a requirement that is difficult to meet because this information is only available for a small fraction of all proteins. Text mining is a special method in molecular biology, which is usually used to reveal the associations between proteins or genes and their functional relationships from text documents. It uses keywords to detect potential drug-target protein interactions, but it is hard to make good use of them. For these drawbacks, it is more practical to develop novel computational models to identify DTIs without the need for information about ligands and 3D target structures.
Recently, various approaches have been reported for detecting novel DTIs. Yamanishi et al. [17] developed a new statistical method that uses genomic sequence information and chemical structure to predict unknown DTI networks. Wang et al. [18] reported a computational model, which utilized a stacked auto encoder based on deep learning that can effectively extract raw data information to identify drug-target interactions. Hao et al. [19] introduced a useful algorithm, called dual-network integrated logistic matrix factorization (DNILMF), which consists of four steps to detect potential drug-target interactions. Wen et al. [20] suggested a deep-learning based algorithm called DeepDTIs. DeepDTIs utilized unsupervised pretraining to abstract representations from raw input descriptors. It can be applied to detect whether a new target interacts with some existing drugs. Ezzat et al. [21] presented a framework that combined feature dimensionality reduction and the ensemble learning model for predicting DTIs. Huang et al. [22] developed a method called MolTrans (Molecular Interaction Transformer) to predict DTIs that combined the interaction modeling module and sub-structural pattern mining algorithm. Zhang et al. [23] developed a method called SPVes that combined SMILES2Vec and ProtVec to convert SMILES strings of drug compounds and sequences of target proteins as feature vectors to predict DTIs. Wang et al. [24] built a heterogeneous drug-target graph to detect DTIs. It used known DTIs, drug-drug, and target-target similarities. Redkar et al. [25] used dipeptide composition and drugs with a molecular descriptor to encode the target protein sequence, and then a machine learning method that combined wrapper feature extraction and the synthetic minority oversampling technique (SMOTE) was adopted to predict DTIs. Although these methods have accelerated discoveries concerning drug-target interactions, there is still room for improvement.
In this study, we present a computational approach to identify potential DTIs based on the information of chemical fingerprints and target protein sequences. The prediction process is divided into three stages. Firstly, the target protein sequences were transformed into position-specific score matrices (PSSMs) to obtain their evolutionary information. Secondly, an effective feature extraction method, dual-tree complex wavelet transform (DTCWT), was performed to extract feature vectors from the PSSMs. Finally, we combined the drug molecule fingerprint information with these vectors to construct feature descriptors and fed them into the Rotation Forest (RoF) classifier. From the voting results of these decision trees, we can observe whether these drugs and target proteins are most likely to interact with each other. To verify the predictive ability of the proposed method, we applied a 5-fold cross-validation (CV) on four benchmark datasets: Enzyme, Ion Channel, GPCRs (G-protein-coupled receptors), and NRs (Nuclear Receptors). Furthermore, we compared the predictive performance of the proposed model with state-of-art SVM and KNN classifier and applied our method on an independent dataset. The comprehensive results demonstrated that our approach is efficient and reliable for predicting potential DTIs.

Evaluation Metrics
In this work, to access the predictive capacity of the proposed approach, we employed four evaluation metrics: accuracy (ACC.), precision (PR.), sensitivity (Sen.), and the Matthews correlation coefficient (MCC). These conventional evaluation indicators can be defined as follows: where true positive (TP) represents the number of interacting drug-target pairs predicted correctly, false positive (FP) represents the number of non-interacting pairs predicted to be interacting, true negative (TN) represents the number of non-interacting pairs predicted correctly, and false negative (FN) represents the number of interacting pairs predicted to be non-interacting. Receiver Operating Characteristic (ROC) curves [26] were plotted based on these parameters, and the area under the ROC curves (AUC) was calculated to summarize the ROC curves numerically. In this way, we were able to provide a more comprehensive measure than other evaluation metrics. The flowchart of the proposed approach for identifying potential DTIs is shown in Figure 1.

Parameter Discussion
To achieve better prediction results, it is important to optimize the best parameters K and L for the Rotation Forest (RoF) model. Here, K represents the number of feature subsets and L represents the total number of decision trees in the RoF classifier. In this

Parameter Discussion
To achieve better prediction results, it is important to optimize the best parameters K and L for the Rotation Forest (RoF) model. Here, K represents the number of feature subsets and L represents the total number of decision trees in the RoF classifier. In this part, we used the grid research for the optimal parameters of RoF. Figure 2 shows the accuracy surface, which was generated by the RoF model and influenced by the parameters K and L. It can be observed that when K = 30 and L = 17, the model obtained the best predictive performance. In this work, we set the K value and the L value to be 30 and 17, respectively.

Parameter Discussion
To achieve better prediction results, it is important to optimize the best parameters K and L for the Rotation Forest (RoF) model. Here, K represents the number of feature subsets and L represents the total number of decision trees in the RoF classifier. In this part, we used the grid research for the optimal parameters of RoF. Figure 2 shows the accuracy surface, which was generated by the RoF model and influenced by the parameters K and L. It can be observed that when K = 30 and L = 17, the model obtained the best predictive performance. In this work, we set the K value and the L value to be 30 and 17, respectively.

Performance Evaluations on Four Golden Standard Datasets
To further test the reliability of our method and avoid over-fitting, we performed 5-fold cross-validation (CV) on four datasets (Enzyme, Ion Channel, GPCRs, and NRs). More specifically, the DTI datasets were split into five parts; four of them were used as the training set and the remaining one was employed to test the model. In this way, the CV process was repeated for five rounds to generate five models. For the sake of consistency, all the parameters of these experiments have been kept identical in this study. Tables 1-4 present the results of the proposed model when adopting the 5-fold CV on four collection datasets. 81.02%, 81.51%, 80.38%, 69.42% and 0.8775, with corresponding standard deviations of 3.77%, 2.90%, 4.47%, 4.76% and 0.0332, respectively. When predicting DTIs for the NRs dataset, we yielded average ACC., PR., Sen., MCC, and AUC values of 74.44%, 72.31%, 78.17%, 61.23%, and 0.7755, with corresponding standard deviations of 5.34%, 7.29%, 10.64%, 5.75%, and 0.0271, respectively. The ROC curves of the RoF classifier obtained for the four datasets are shown in Figures 3-6.   81.02%, 81.51%, 80.38%, 69.42% and 0.8775, with corresponding standard deviations of 3.77%, 2.90%, 4.47%, 4.76% and 0.0332, respectively. When predicting DTIs for the NRs dataset, we yielded average ACC., PR., Sen., MCC, and AUC values of 74.44%, 72.31%, 78.17%, 61.23%, and 0.7755, with corresponding standard deviations of 5.34%, 7.29%, 10.64%, 5.75%, and 0.0271, respectively. The ROC curves of the RoF classifier obtained for the four datasets are shown in Figures 3-6.      To prove that the predictive performance of our model is not dependent on the selection of negative samples, we applied our method to five different GPCRs negative samples, which were randomly selected from the non-interacting drug-target pairs. The predictive results of the five different negative samples are listed in Table 5. It can be observed that the experimental results of these five samples were not significantly different. The average ACC., PR., Sen., MCC, and AUC values are higher than 81%, 82%, 79%, 69% and 0.88, respectively. These results further indicate that our method for constructing the negative samples in this work is effective for predicting potential DTIs. The remarkable prediction performance can be attributed to the robust feature descriptors and powerful RoF classifier. The application of DTCWT to extract feature vectors is novel and effective. As a sequence encoding method, PSSM can retain the useful information of amino acid sequences. The excellent results suggested that the RoF algorithm is suitable for detecting potential drug-target proteins. To prove that the predictive performance of our model is not dependent on the selection of negative samples, we applied our method to five different GPCRs negative samples, which were randomly selected from the non-interacting drug-target pairs. The predictive results of the five different negative samples are listed in Table 5. It can be observed that the experimental results of these five samples were not significantly different. The average ACC., PR., Sen., MCC, and AUC values are higher than 81%, 82%, 79%, 69% and 0.88, respectively. These results further indicate that our method for constructing the negative samples in this work is effective for predicting potential DTIs. The remarkable prediction performance can be attributed to the robust feature descriptors and powerful RoF classifier. The application of DTCWT to extract feature vectors is novel and effective. As a sequence encoding method, PSSM can retain the useful information of amino acid sequences. The excellent results suggested that the RoF algorithm is suitable for detecting potential drug-target proteins.

Comparison Results between LPQ-Based Model and the Proposed Method
Many describers have been introduced to detect DTIs, with local phase quantization (LPQ) [27] being one of the most popular algorithms. To verify the performance of the DTCWT descriptor, we compared it with the LPQ method. The cross-validation results of the LPQ descriptor combined with the RoF classifier are summarized in Table 6. It can be observed that the proposed approach generated the best results in terms of ACC, PR, MCC, and AUC values. For the sake of consistency, the same parameters were used in the comparison experiment. From the comparison results, we can observe that the DTCWT descriptor combined with the RoF classifier can improve the prediction performance of the model. The detailed 5-fold CV results performed by the LPQ algorithm on the four datasets are summarized in the Supplementary Materials, Tables S1-S4.

Comparison with SVM and KNN Classifier
Various machine learning algorithms have previously been used to identify DTIs [28,29]. To further evaluate the predictive capacity of our method, we used the same feature descriptors in the SVM and KNN classifiers and compared the predictive performance using the same four datasets. The main idea of the SVM algorithm is that it can perform both linear classification and non-linear classification problems. KNN is a supervised machine learning technique which can solve the classification task. The LIBSVM tool [30] was used in this paper to train the SVM model. There are two parameters of SVM that need to be optimized: c (penalty parameters) and g (kernel function parameters). The parameters c and g from the SVM classifier were optimized by a grid search method, with c values from 1 to 25 and g values from 0.1 to 5. In the experiments for the Enzyme and Ion Channel datasets, we set c = 7, g = 0.2 and c = 3, g = 4, respectively. When exploring the proposed method for the GPCRs and NRs datasets, we set c = 7, g = 1.3 and c = 23, g = 0.1, respectively. The KNN model needs to choose the neighbor k and distance measuring function. Here, we optimized K from 1 to 10 to train the KNN model. In this paper, K has been set as 5 and the distance measuring function as L1. Table 7 lists all the experimental results of RoF, SVM, and KNN models on the four DTIs datasets. From these results, we can see that our method achieves better prediction results than SVM-and KNN-based methods. For example, the AUC gaps between SVM and RoF on the four datasets were 0.1486, 0.1587, 0.2123, and 0.1535, respectively. Similarly, the ACC gaps between KNN and RoF were 8.68%, 6.47%, 17.16%, and 26.11%, respectively. The ROC curves and comparison results yielded by the SVM and KNN models are shown in the Supplementary Materials, Figures S1-S5.

Comparison with Different Methods on the Same Dataset
In recent years, many different kinds of excellent computational approaches have been put forward to predict DTIs. To further confirm the effectiveness of our method, we compared it with some previous prediction studies, which used the same benchmark datasets, including Yamanishi et al. [31], KBMF2K [32], MLCLE [33], AM-PSSM [34], SIMCOMP [35], DBSI [36], and NETCBP [37]. The average AUC values of these approaches are summarized in Table 8. It can be observed that our method performed better than other methods on the Enzyme, Ion Channel, and GPCRs datasets. However, it did not work very well on he NRs dataset, perhaps because the NRs dataset was too small to optimally train the RoF model.

Performance on the Independent Dataset
To demonstrate the generalizability of our model, we performed it on an independent dataset. The Enzyme dataset was used as the training set and the Drugbank-approved dataset was employed as the testing dataset. For fairness, we set the same parameters as for the RoF model (K = 30, L = 17). When applying our model on the Drugbank-approved dataset, it yielded a high accuracy of 72.37%, PR. of 69.23%, Sen. of 74.46%, MCC of 59.94%, and AUC of 0.7833. The predictive results on the independent dataset further indicate that our method is useful for predicting unknown DTI pairs.

Data Collection
In this work, we selected four DTI datasets: Enzyme, Ion Channel, GPCRs and Nuclear Receptors (NRs). These data can be collected from BRENDA [38], KEGG [39], SuperTarget database [40], and DrugBank [41]. The numbers of drug compounds, target proteins, and known interactions are summarized in Table 9. We constructed a bipartite graph to present the relations between drugs and proteins, where the nodes represent the target proteins or drug compounds, and the links represent the interactions between them. Here, we set the Ion Channel dataset as an example, for which the total number of interactions is 42840 (204 × 210) in the corresponding bipartite. However, only 1476 pairs have been shown to have interactions. Thus, the possible number of negative Ion Channel DTI pairs is 41364 (42840−1476), which is significantly more than the positive samples. To deal with this bias problem, we randomly collectedly 1476 non-interacting DTI pairs as the negative samples. The negative samples that we obtained from this method may contain some truly interacting pairs. However, given the size of the DTI datasets, the probability of this situation is very small. We also used a dataset called Drugbank-approved [42] as the independent dataset to further verify the predictive ability of our model. The drugs and proteins in this dataset are all approved by the FDA and DrugBank database [41]. After removing the non-existing drugs and proteins, we obtained 1555 drugs, 1591 target proteins, and 5831 interactions.

Characterization of Drug Molecules
From previous research, many descriptors have been proposed to represent the properties of drug molecules, such as the topological, geometric, constitutional, and quantum chemical descriptors. Recently, some studies found that molecular substructure fingerprints can be used to represent drug compound structures [43]. By encoding the drugs as Boolean substructure vectors, the substructure fingerprints can directly indicate whether each compound has a specific chemical substructure of the drug molecule. It proves that after being separated into fragments, its substructure remains. In a binary fingerprint vector, each bit position will correspond to a specific substructure. If the corresponding substructure of a given drug molecule is present, the corresponding bit will be set to 1; otherwise, it will be assigned to 0. In this way, the complex structures of drug molecules can be represented by the substructure fingerprints. Although the fingerprint splits the whole molecule into many fragments, it can still provide structural information for drug molecules. Moreover, substructure fingerprints do not need 3D structural data for the target, so it will not cause error accumulation.
The substructure fingerprint sets employed in this work were downloaded from the PubChem System (available at https://pubchem.ncbi.nlm.nih.gov/, accessed on 4 June 2009). It defines 881 chemical substructures, which have each been assigned to a specific site. Therefore, each drug molecule feature has been transformed into a binary vector of 881 dimensions.

Representation of Target Proteins
The position-specific scoring matrix (PSSM) [44] was proposed for testing the distantly related proteins. In recent years, PSSM has been widely used for mining the evolutionary information of protein sequences [45]. The PSSM is a P × 20 matrix. The number of amino acids in the proteins is represented by P, and the naive amino acids are represented by 20 columns. Supposing L =, then the following is a summary of each matrix: where ϕ i,j in the i th row of PSSM indicates the probability of the i th residue being mutated into the j th native amino acid.
In this article, we used the position-specific iterated BLAST (PSI-BLAST) [46] tool, which was against the database of SwissProt, to generate the PSSM for the purpose of extracting evolutionary information. To obtain high homologous sequences, the expectation value (e-value) was set to 0.001, the number of iterations was set to 3, and other parameters were maintained as the default values [47]. The SwissProt database and PSI-BLAST can be freely obtained from http://blast.ncbi.nlm.nih.gov/Blast.cgi (accessed on 1 January 2001).

Feature Extraction Method
The dual-tree complex wavelet transform (DTCWT) [48] is an enhanced version of the discrete wavelet transform (DWT) [49]. It was developed to help improve the directional selectivity impaired by DWT. In addition, it compensates for the fact that DWT has a large computation volume and high complexity. Unlike conventional DWT, DTCWT is constructed by two real DWTs [50]. The first DWT is used to generate the real part of the transform, while the second DWT generates the imaginary part.
The DTCWT addressed the disadvantages of DWT regarding shift-invariant problems and directional selectivity in two or more dimensions. The directional selectivity of DTCWT can be yielded by the wavelets, which are approximately analytic. It can produce six directionally selective sub-bands (±15 • , ±45 • and ±75 • ), with (R) and (I) describing the real and imaginary parts, respectively. The flowchart of the DTCWT algorithm is shown in Figure 7. In the first stage, the filters can be denoted as h i (n) and g i (n). The first filter bank can be represented by H The directionality of the complex wavelet function is provided by          . That said, at each decomposition level,   , F a b is decomposed by DTCWT into a low-pass sub-band and six complex valued high-pass sub-bands, and each high-pass sub-bands corresponds to a specific direction  . In this experiment, after a PSSM matrix was performed via DTCWT, each target protein sequence was defined as a 256-dimensional feature vector.

Rotation Forest Classifier (RoF)
Rotation Forest (RoF) is an effective and powerful ensemble learning method which was first proposed by Rodriguez [51]. The main contribution of RoF is to establish the ensemble classifiers that can obtain a balance between diversity and accuracy. In this algorithm, the attribute set of samples is first randomly divided, and each subset is transformed by a linear transformation to increase the diversity of samples. Then, the transformed subsets are fed into different decision trees, and the final classification results can be aggregated from the votes of all trees in the forest.
Suppose that     new (e jw )} (6) if and only if A 2D image F(a, b) is formed by 2D DTCWT over the complex scaling function and a series of expansion and translation of six wavelet functions α θ j,l , that is, The directionality of the complex wavelet function is provided by θ ∈ Θ = {±15 • , ±45 • , ±75 • }. That said, at each decomposition level, F(a, b) is decomposed by DTCWT into a low-pass sub-band and six complex valued high-pass sub-bands, and each high-pass sub-bands corresponds to a specific direction θ. In this experiment, after a PSSM matrix was performed via DTCWT, each target protein sequence was defined as a 256-dimensional feature vector.

Rotation Forest Classifier (RoF)
Rotation Forest (RoF) is an effective and powerful ensemble learning method which was first proposed by Rodriguez [51]. The main contribution of RoF is to establish the ensemble classifiers that can obtain a balance between diversity and accuracy. In this algorithm, the attribute set of samples is first randomly divided, and each subset is transformed by a linear transformation to increase the diversity of samples. Then, the transformed subsets are fed into different decision trees, and the final classification results can be aggregated from the votes of all trees in the forest.
Suppose that {q i , p i } contains T samples, of which q i = (q i1 , q i2 , q i3 , · · · , q iL ) is an L-dimensional feature vector. Let Z represent the training sample set containing T training samples, forming a matrix of T × L. Let U represent the feature set and M the label set. Assume the number of decision trees is S, then the decision trees can be denoted as D 1 , D 2 , D 3 , · · · , D S . The rotation forest algorithm is implemented as follows.
(1) Choose a suitable parameter M for which U can be randomly split into M disjointed subsets, with the number of features contained in the feature subset being equal to L/M. (2) Let U i,j represent the j th feature subset and use it to train the classifier D i . The sample subset Z i,j is constructed by a non-empty subset, which is randomly selected from a certain proportion. (3) Apply PCA [52] on Z i,j to order the coefficients stored in matrix λ i,j .
(4) The coefficients obtained from the matrix λ i,j are used to construct a sparse rotation matrix ϕ i , which can be defined as follows: During the prediction process, given a test sample g that is generated by the classifier D i of R i,j (Zϕ a i ), which is introduced to indicate that g belongs to class p i . Then, the class of confidence is calculated via the average combination; the formula can be expressed follows: Then, assign the category with the largest V j (g) value to g.

Case Study
To further demonstrate the generality of the proposed method, we applied our method on two real-life drug-target pairs; the drug was Flurbiprofen and two target proteins were prostaglandin-endoperoxide synthase 1-type and inhibitor of nuclear factor kappa-B kinase subunit epsilon. The lengths of the two proteins are 599 and 716, respectively. Our method predicted that the drug Flurbiprofen would interact with the target protein prostaglandin-endoperoxide synthase 1-type with a probability score of 0.96, and would not interacted with the target protein inhibitor of nuclear factor kappa-B kinase subunit epsilon with a probability score of 0.08. The interacting drug-target pairs have been confirmed by the KEGG database. The experimental results of these two real-life drug-target pairs further indicates that the proposed model is effective for predicting potential DTIs.

Conclusions
In this article, we developed a novel computational method for identifying DTIs using information regarding target protein sequences and the substructure fingerprints of drug molecules. It combines the position-specific scoring matrix (PSSM), dual-tree complex wavelet transform (DTCWT), and Rotation Forest (RoF). In order to evaluate the prediction performance of the proposed method, we performed tests on four datasets (Enzyme, Ion Channel, GPCRs, and Nuclear Receptors) by adopting a 5-fold cross validation (CV). The proposed approach obtained high average accuracies of 89.21%, 85.49%, 81.02%, and 74.44%, respectively. To verify the predictive capacity of our method, we compared it with the SVM and KNN algorithms and some existing studies. The experimental results on the independent dataset further demonstrated that our model can be used as a valuable tool to predict potential drug-target interactions. In the future, we need to find more efficient feature extraction methods and reduce the computational complexity for DTI prediction.
Supplementary Materials: The following are available online, Table S1: 5-fold CV results achieved by LPQ-based method on Enzyme dataset, Table S2: 5-fold CV results achieved by LPQ-based method on Ion Channel dataset, Table S3: 5-fold CV results achieved by LPQ-based method on GPCRs dataset, Table S4: 5-fold CV results achieved by LPQ-based method on Nuclear Receptor dataset, Figure  Data Availability Statement: All the data are available at https://github.com/jie-pan111/prediction_ of_DTIs (1 September 2021) and protein sequence data are available at http://web.kuicr.kyoto-u.ac. jp/supp/yoshi/drugtarget/, accessed on 1 September 2021.