Next Article in Journal / Special Issue
RoFDT: Identification of Drug–Target Interactions from Protein Sequence and Drug Molecular Structure Using Rotation Forest
Previous Article in Journal
D-Carvone Attenuates CCl4-Induced Liver Fibrosis in Rats by Inhibiting Oxidative Stress and TGF-ß 1/SMAD3 Signaling Pathway
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

MSPEDTI: Prediction of Drug–Target Interactions via Molecular Structure with Protein Evolutionary Information

1
Big Data and Intelligent Computing Research Center, Guangxi Academy of Sciences, Nanning 530007, China
2
College of Information Science and Engineering, Zaozhuang University, Zaozhuang 277160, China
3
Computer Science and Technology, Tongji University, Shanghai 200092, China
4
School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230601, China
5
School of Computer Science, Northwestern Polytechnical University, Xi’an 710129, China
*
Authors to whom correspondence should be addressed.
Biology 2022, 11(5), 740; https://doi.org/10.3390/biology11050740
Submission received: 22 April 2022 / Revised: 3 May 2022 / Accepted: 4 May 2022 / Published: 13 May 2022
(This article belongs to the Special Issue Intelligent Computing in Biology and Medicine)

Abstract

:

Simple Summary

Drug discovery is the process of identifying potential new compounds through biological, chemical, and pharmacological means. Billions of dollars are spent each year on research aimed at discovering, designing, and developing new drugs for a wide range of diseases. However, the research and development of new drugs remain time-consuming and sometimes difficult to complete. With the development of new experimental techniques, huge amounts of data are generated at different stages of drug development. Biomedical research, especially in the field of drug discovery, is currently undergoing a major shift towards “big data” applications of artificial intelligence technologies. Therefore, a key challenge for future drug discovery research is the development of robust artificial-intelligence-based predictive tools for drug–target interactions (DTIs) that can study biomedical problems from multiple perspectives. In this study, a deep-learning-based prediction model for DTIs was designed by combining information on drug structure and protein evolution to provide theoretical support for drug research.

Abstract

The key to new drug discovery and development is first and foremost the search for molecular targets of drugs, thus advancing drug discovery and drug repositioning. However, traditional drug–target interactions (DTIs) is a costly, lengthy, high-risk, and low-success-rate system project. Therefore, more and more pharmaceutical companies are trying to use computational technologies to screen existing drug molecules and mine new drugs, leading to accelerating new drug development. In the current study, we designed a deep learning computational model MSPEDTI based on Molecular Structure and Protein Evolutionary to predict the potential DTIs. The model first fuses protein evolutionary information and drug structure information, then a deep learning convolutional neural network (CNN) to mine its hidden features, and finally accurately predicts the associated DTIs by extreme learning machine (ELM). In cross-validation experiments, MSPEDTI achieved 94.19%, 90.95%, 87.95%, and 86.11% prediction accuracy in the gold-standard datasets enzymes, ion channels, G-protein-coupled receptors (GPCRs), and nuclear receptors, respectively. MSPEDTI showed its competitive ability in ablation experiments and comparison with previous excellent methods. Additionally, 7 of 10 potential DTIs predicted by MSPEDTI were substantiated by the classical database. These excellent outcomes demonstrate the ability of MSPEDTI to provide reliable drug candidate targets and strongly facilitate the development of drug repositioning and drug development.

1. Introduction

Drug research is a global development problem. In the past few decades, the drug-targeted therapy strategy has achieved great success [1,2]. Finding specific drugs for targets is the focus of pharmaceutical research and development, which has made an indelible contribution to human health [3]. However, the rate of new drug development has been declining in recent years, and the cost of research and development has been rising [4]. The main reason for this is that the early screening of a large number of drug candidates in drug research still relies mainly on time-consuming and labor-intensive experimental methods, and the later discovery of unsatisfactory efficacy or toxic side effects of drugs leads to the failure of development. Therefore, efficient and high-throughput computational techniques in the early stages of drug research can play an important role in targeting and saving costs in early development [5,6,7,8].
With the rapid development of bioinformatics, many achievements have been achieved by using computational and simulation approaches to predict DTIs. Quantitative structure–activity relationship (QSAR) utilizes the physicochemical properties or structural parameters of the molecule to quantitatively study the interaction between small molecules and biological macromolecules by means of mathematics. Casañola-Marti et al. proposed a QSAR model for predicting anti-tyrosinase activity and demonstrated the effectiveness of the model in subsequent in vitro experiments, which greatly increased the rate of biochemical discovery of skin disease treatment [9]. Kar et al. proposed an approach to predict the carcinogenicity of drug compounds based on QSAR, which has been identified as a key factor in carcinogenicity by analyzing the contribution of molecular fragments to carcinogenicity [10]. Molecular docking (MD) is a computational simulation method for studying the optimal binding sites between drug molecules and target proteins by structural matching and energy matching and predicting their binding patterns and affinity [11]. Wallach et al. proposed a model to normalize docking scores through the virtually generated bait set that avoids the variability due to changes in physical properties when identifying active compounds in large screening libraries, thereby extending the applicability of the model [12].
Recently, computational methods for predicting DTIs based on protein target sequences have achieved excellent results and are favored by researchers for their use of reliable, high-quality characterization information enriched by raw data to ensure the accuracy of prediction results [13,14,15,16,17,18]. For instance, Lan et al. proposed a PUDT model combining protein target sequences and drug compound structures, which greatly improved the accuracy of DTI prediction using a weighted SVM classifier [19]. Cao et al. aimed to predict DTIs by using an extended structure–activity relationship method at the genome-scale level. In subsequent experiments, this approach gained good results [20].
In the present study, we combined protein sequence evolution with drug structure information to propose a deep learning MSPEDTI model to predict hidden DTIs. Concretely, MSPEDTI first fuses protein sequence information characterized by the Position-Specific Scoring Matrix (PSSM) and drug structure information characterized by molecular fingerprinting, and then automatically extracts them into continuous, low-dimensional, information-rich features using a deep learning CNN, thus avoiding the disadvantages of manual features such as tediousness, sparsity, and high dimensionality. Finally, the ELM classifier is used to accurately determine whether drug–target pairs are associated or not. In the gold-standard dataset, we evaluated MSPEDTI using the five-fold cross-validation (5CV) approach. Compared with other previous methods, MSPEDTI was able to learn valid biological characteristics for predicting DTIs and showed better performance. The robustness of MSPEDTI is also demonstrated by the experimental results of the case study, which can provide effective candidate targets for new drug research. The supporting data used in this study can be downloaded from https://github.com/look0012/MSPEDTI (accessed on 1 April 2022).

2. Materials and Methods

2.1. Gold-Standard Datasets

In the present study, we implemented the MSPEDTI model using the gold-standard datasets enzyme, GPCR, ion channel, and nuclear receptor, which were collated by Yamanishi et al. [21] from the BRENDA [22], KEGG [23,24], SuperTarget [25], and DrugBank [26] databases. After removing the redundant information, the numbers of DTI pairs contained in these datasets are 2926, 635, 1467, and 90, respectively. All of these pairs are constructed as positive datasets. Table 1 presents the statistical information for these gold-standard datasets.
The corresponding negative dataset construction process is as follows: firstly, all drug–target interaction pairs are divided into drug and target components; secondly, these drug and target are recombined into DTI pairs, and the pairs of interactions are removed. Finally, these drug–target pairs are randomly selected to construct the negative dataset, which is the same size as the positive dataset.

2.2. Drug Structure Characterization

We employed molecular fingerprints in this study to characterize the drug structures for the purpose of numerical conversion. The design idea of fingerprints is to characterize the molecular structure using the form of a dictionary collection of molecular fragments, which converts a drug molecule into a binary vector of values by determining whether certain fragments, i.e., molecular substructures, are present in the molecule. It first divides the molecular structure to obtain the structural fragments, and then encodes the fragments of these molecular structures into numbers according to certain rules and corresponds to each bit of the binary string, thus combining them as a whole (binary string) as a characterization of the molecular structure.
At present, the commonly used molecular fingerprints are FP4 fingerprint, MACCS fingerprint, Estate fingerprint, and PubChem fingerprint, and their corresponding molecular structure fragment numbers of 307, 166, 79, and 801. In this experiment, molecular fingerprints from the PubChem database were selected to characterize the drug structure of DTIs. The drug molecule is decomposed into 881 substructures in this descriptor. Given a drug, encode its corresponding bit as 1 or 0 depending on whether its molecular substructure is present. The fingerprint is encoded in Base64 on the PubChem website and provides a text description of it in binary, available for download from https://pubchem.ncbi.nlm.nih.gov/ (accessed on 1 January 2018).

2.3. Target Protein Characterization

In the experiments, the Position-Specific Scoring Matrix (PSSM) was used to numerically characterize the target protein. The PSSM can effectively describe the evolutionary information of protein amino acids, and it is commonly used in protein secondary structure prediction [27], protein binding site prediction [28], disordered region prediction [29], and distantly related protein detection [30,31] domains. The PSSM is a matrix of H × 20 , where H is the length of the protein, and 20 is the type of amino acid. The PSSM P s s m = { Θ i , j : i = 1 H   a n d   j = 1 20 } can be expressed equationally as follows:
P s s m = [ Θ 1 , 1 Θ 1 , 2 Θ 1 , 20 Θ 2 , 1 Θ 2 , 2 Θ 2 , 20 Θ H , 1 Θ H , 2 Θ H , 20 ]
Here, the matrix element Θ i , j indicates the probability that the i - th residue of the protein mutates to the i - t y p e amino acid during the evolutionary process.
In the implementation, we utilized the Position-Specific Iterated BLAST (PSI-BLAST) [32] to calculate the PSSM by comparing it with the SwissProt database. We followed the previous study, setting the parameter iterations and e-value of the PSI-BLAST tool to 3 and 0.001 to obtain high homologous sequences in the experiment. The database and tool are available for download from http://blast.ncbi.nlm.nih.gov/Blast.cgi (accessed on 18 March 2002).

2.4. Feature Extraction

In the MSPEDTI model, the convolution neural network (CNN) algorithm of deep learning is used to extract the hidden features of the protein. Deep learning can learn the intrinsic patterns and levels of representation of sample data, thus enabling machines to have the same analytical learning capabilities as humans. As one of the representative algorithms of deep learning, CNN is able to classify the input information in a translation-invariant manner by hierarchical structure, thus deeply mining the essential features of data. Therefore, we introduced it into MSPEDTI to greatly strengthen the model prediction capability.
CNN is a feedforward neural network with artificial neurons that respond to a portion of the surrounding units in the coverage area, including convolutional, pooling, sampling, fully connected, input, and output layers. With its special structure of local weight sharing, CNN has unique advantages in feature extraction, and its layout is closer to the actual biological neural network. CNN has unique superiority in feature extraction, with its special structure of local weight sharing, and its layout is closer to the actual biological neural network. Weight sharing reduces the complexity of the network, especially the feature that multidimensional input vectors can be directly input into the network, which avoids the complexity of data reconstruction in the process of feature extraction and classification. The structure diagram of CNN is shown in Figure 1. Assuming that C i is the feature map of layer   i th , its description can be:
C i = g ( C i 1 · W i + b i )
Here, operator · indicates convolution operations, b i indicates the offset vector, W i indicates the weight matrix of the i th layer convolution kernel, and g ( x ) indicates the activation function. The subsampling layer follows the convolutional layer and samples the feature map according to specific rules. Let C i be the subsampling layer with the following sampling rules:
C i = s u b s a m p l i n g ( C i 1 )
After multiple convolution and sampling, the features are classified by the fully connected layer to yield the data distribution Γ of the original input. Fundamentally, CNN can be regarded as a mathematical model that uses multilevel dimensional transformations to transform the original data C 0 into a new feature representation Γ .
Γ ( i ) = M a p ( P = p i | C 0 ;     ( W , b ) )
Here, Γ represents the feature representation, p i indicates the i th label class, and C 0 represents the original data.
Minimizing the loss function H ( W , b ) is the ultimate goal of CNN training. Therefore, CNNs are typically trained to solve the overfitting problem by controlling the fitting strength using the parameter θ and adjusting the loss function L ( W , b ) by generalizing the norm.
L ( W , b ) = H ( W , b ) + θ 2 W T W
CNNs normally update their network layer parameters ( W , b ) layer by layer by gradient descent in the training phase and control the backpropagation function to exploit the learning rate ε .
W i = W i ε E ( W , b ) W i
b i = b i ε E ( W , b ) b i

2.5. Classification Prediction

The extreme learning machine (ELM) [33] is employed by MSPEDTI as a classifier to predict potentially associated DTIs. The ELM is a simple and effective single-hidden layer feedforward neural network learning algorithm that does not need to adjust the input weights of the network and the bias of the hidden elements during the execution and produces a unique optimal solution, so it has the advantages of fast learning and good generalization performance.
Given input samples ( X i , P i ) with L tagged, the ELM consisting of N neurons can be formulated as:
i = 1 N V i g ( W i · X j + b i ) = O j ,     j = 1 , , L
where X i = [ x i 1 , x i 2 , , x i L , ] T L , P i = [ P i 1 , P i 2 , , P i m ] T m , g ( x ) indicates the activation function, V i indicates the output weight matrix, W i = [ w i 1 , w i 2 , , w i L ] T stands for the input weight matrix, W i · X j stands for the inner product of W i and X j , and b i stands for the offset of the i th neurons.
To realize the minimization of the output error, i.e., the training goal of j = 1 L O j P j = 0 , the ELM needs to optimize its hyperparameters.
i = 1 N V i g ( W i · X j + b i ) = P j ,     j = 1 , , L
The equation can be simplified as follows:
S V = P
S = [ g ( W 1 · X 1 + b 1 ) g ( W N · X 1 + b N ) g ( W 1 · X L + b 1 ) g ( W N · X L + b N ) ] L × N     V = [ V 1 T V N T ] N × m     P = [ P 1 T P L T ] L × m
Here, V means the output weight,   P means the expected output, and S means the hidden layer neurons output. To gain optimal performance, we want the ELM to acquire W i ^ , b i ^ and V i ^ , that is:
S ( W i ^ , b i ^ ) V i ^ P = min W , b , V S ( W i , b i ) V i P         i = 1 , 2 , , N
This equates to minimizing the loss function
E = j = 1 L ( i = 1 N V i g ( W i · X j + b i ) P j ) 2
By the principle of the ELM algorithm, when the input weight W i and the offset b i of the hidden layer are ascertained, the ELM is able to uniquely obtain its output matrix. Therefore, the training problem of the ELM is transformed into the problem of solving the linear equation S V = P with a minimal and unique interpretation.

3. Results

3.1. Evaluation Indicators

We measured the performance of MSPEDTI in the present study using the evaluation indicators calculated by the five-fold cross-validation method (5CV). The 5CV approach first splits the whole dataset D into five subsets D 1 , , D 5 , which are roughly equal in size and do not intersect with each other. When testing subset D i , the remaining subsets D D i are fed into the classifier as the training set. Loop this operation until all subsets have been tested. The performance of MSPEDTI was evaluated by the average results and deviations of the five experiments. There are several evaluation indicators calculated through 5CV, which are described by the following equations.
A c c u . = T P + T N T P + T N + F P + F N
S e n . = T P T P + F N
S p e c . = T N T N + F P
P r e c . = T P T P + F P
M C C = T P × T N F P × F N ( T P + F P ) ( T P + F N ) ( T N + F P ) ( T N + F N )
where T P means true positive, T N means true negative, F P means false positive, and F N means false negative. Additionally, we plotted the operating characteristic curve (ROC) generated by 5CV and calculated its area under the curve (AUC) [34,35].
ROC is an essential metric for assessing the comprehensive performance of the model, which visualizes the variation between specificity and sensitivity and is displayed graphically. It computes a set of specificities and sensitivities by setting multiple different thresholds for successive variables, and then plots curves by using 1-specificity as abscissa and sensitivity as ordinate.

3.2. Assessment of Performance

Gold-standard dataset enzymes, ion channels, GPCRs, and nuclear receptors were used to measure the capabilities of MSPEDTI in the experiment. The detailed outcomes of 5CV obtained by MSPEDTI on these datasets are listed in Table 2, Table 3, Table 4 and Table 5, respectively. From these tables, it is possible to observe that MSPEDTI accomplished satisfactory prediction accuracy, with values of 94.19%, 90.95%, 87.95%, and 86.11%, and their standard deviations were 0.41%, 1.10%, 1.51%, and 4.39%, respectively. In the enzyme dataset, the accuracy of all five MSPEDTI experiments was higher than 93.85%, with the highest result reaching 94.87%, and their standard deviations values were 94.87%, 94.27%, 93.85%, 94.02%, and 93.94%, respectively. MSPEDTI achieved good results of 88.51%, 81.95%, 76.41%, and 72.46% on MCC, which was used to measure classification performance, and its standard deviations were 0.89%, 2.24%, 2.88%, and 8.97%, respectively. On the comprehensive performance assessment index AUC, MSPEDTI gained 94.37%, 90.88%, 88.02%, and 86.63%, with standard deviations of 0.59%, 0.97%, 2.88%, and 4.77%, respectively. Additionally, MSPEDTI also yielded more satisfactory outcomes in terms of sensitivity and precision. The ROC curves produced by MSPEDTI for 5CV on the four gold-standard datasets are shown in Figure 2, Figure 3, Figure 4 and Figure 5.

3.3. Comparison of Different Descriptor Model

To estimate the impact of feature descriptors on MSPEDTI performance, we compared it with the two-dimensional principal component analysis (2DPCA) descriptor model. 2DPCA is an advanced version of the principal component analysis algorithm [36], which does not need to convert raw data into one-dimensional vectors, which is equivalent to removing the correlation of the row vector or column vector of the matrix. So, it can directly calculate the covariance training sample matrix and has the advantage of calculating the feature vectors quickly.
To validate the representation capability of the features extracted by CNN, we compared it with the 2DPCA descriptor on the ion channel dataset. In the interest of fairness, the other modules in MSPEDTI were kept unchanged, and only the feature extraction module was replaced. The 5CV results produced by the two descriptor models on the ion channel dataset are shown in Table 6, in which it can be observed that the MSPEDTI-generated results are higher than the 2DPCA descriptor model. The experimental outcomes of the contrast indicated that the CNN algorithm extracts the features better than the 2DPCA algorithm in our model. Figure 6 shows the ROC curve plotted on the ion channel by utilizing the 2DPCA descriptor method.

3.4. Comparison with Different Classifier Model

To validate whether the classifier helps to improve the performance of MSPEDTI, we compared it with the SVM classifier model in the same dataset. The learning strategy of SVM is to maximize the sample interval, thus converting it to the solution of the convex quadratic programming problem [37,38]. Similar to the ablation experiments for the descriptor model, in the comparisons of the classifier models, we only replaced the ELM classifier with the SVM classifier and left the other modules unchanged.
Table 7 presents the 5CV experimental outcomes of the MSPEDTI and SVM classifier model on the ion channel dataset. It is possible to observe from the table that the SVM classifier model performs well, and the accuracy, AUC, MCC, precision, and sensitivity are 86.48%, 86.64%, 73.05%, 83.86%, and 89.05%, respectively. However, compared with the ELM classifier, there are still some gaps, and the values of the above evaluation criteria are lower by 4.47%, 1.26%, 7.90%, 8.90%, and 4.24% respectively. These results indicate that the ELM classifier is indeed helpful to improve the prediction performance of MSPEDTI. Figure 7 shows the ROC curve plotted on the ion channel through utilizing the SVM classifier model.

3.5. Comparison with Previous Approaches

We compared MSPEDTI with previous methods in the gold-standard dataset to assess its ability to predict DTIs in a more intuitive way. Here, we picked the metric AUC, which best reflects the overall comprehensive capability of the model as the evaluation criterion. The AUC values resulting from these previous methods, including Yamanishi [4], DBSI [39], KBMF2K [40], Temerinac-Ott [41], NLCS [42], WNN-GIP [43], SIMCOMP [42], and NetCBP [44], are aggregated in Table 8. It can be observed from the table that MSPEDTI yielded optimal results in all four gold-standard datasets over the previous method. This suggests that the strategy of combining the CNN algorithm with the ELM classifier used by MSPEDTI can greatly enhance the ability to predict DTIs.

3.6. Case Studies

To further verify MSPEDTI’s ability in predicting new pairs, we trained it using all available data and predicted the unknown DTIs with the trained model. We searched the SuperTarget database [25] for the 10 highest-ranked DTI pairs of predicted associations. SuperTarget is a publicly available classic database that stores information about DTIs, and it currently collects 332,828 DTIs. Table 9 lists the top ten DTIs with the highest predictive score, from which we can see that seven potential DTIs were validated in the SuperTarget database. These outcomes indicated that MSPEDTI has outstanding capabilities in predicting new DTIs. Notably, while the rest of the three DTI interactions were not found in the current database, there is also the possibility of interaction between them.

4. Discussion

Accurate identification of the target protein of the drug can improve the efficacy of the drug and reduce side effects, thereby improving people’s health. In the current study, we presented a model MSPEDTI to predict DTI on the basis of protein evolution and molecular structures. The model takes full advantage of the protein evolutionary information and drug molecular information and uses a deep learning algorithm to mine the deep association between them. The experimental outcomes in the four gold-standard datasets revealed that the MSPEDTI model has outstanding performance.
However, there are still some shortcomings in our method: firstly, the number of DTIs known at present is still relatively small, and the model cannot be trained adequately; secondly, the parameters of the deep learning algorithm used in the model need to be further optimized to avoid overfitting in some cases; finally, how to integrate more biological information into the model is still worth further study.

5. Conclusions

In the present work, we designed a deep learning model MSPEDTI for predicting DTI on the basis of drug structure and protein evolution information. The model deeply excavates hidden features in protein evolutionary information by CNN, combines them with drug molecular fingerprint features, and uses ELM to efficiently predict potential DITs. The model on the gold-standard datasets enzymes, GPCRs, ion channels, and nuclear receptors, attained better 5CV results. To evaluate whether the modules used by MSPEDTI contribute to boost model performance, we implemented ablation experiments and compared them with other descriptor and classifier models. Furthermore, 7 of the 10 DTIs predicted by MSPEDTI were substantiated in authoritative databases. The exceptional results as mentioned above indicate that MSPEDTI has outstanding ability to predict DTIs and can provide reliable candidate targets for drug research. In the next step of our research, we will try to optimize the deep learning feature extraction method to mine more useful information from the raw data.

Author Contributions

Conceptualization, L.W. (Lei Wang), L.W. (Leon Wong) and Z.-H.C.; methodology, J.H., X.-F.S. and Y.L.; writing—original draft preparation, L.W. (Lei Wang), writing—review and editing, L.W. (Leon Wong) and Z.-H.Y.; investigation, L.W. (Lei Wang); funding acquisition, L.W. (Lei Wang) and Z.-H.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported in part by the National Natural Science Foundation of China, under Grants 62172355 and 61702444, in part by the Tianshan Youth—Excellent Youth, under Grant 2019Q029, in part by the West Light Foundation of the Chinese Academy of Sciences, under Grant 2018-XBQNXZ-B-008, and in part by the Qingtan Scholar Talent Project of Zaozhuang University.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

The authors would like to thank all anonymous reviewers for their constructive advice.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Mamoshina, P.; Volosnikova, M.; Ozerov, I.V.; Putin, E.; Skibina, E.; Cortese, F.; Zhavoronkov, A. Machine learning on human muscle transcriptomic data for biomarker discovery and tissue-specific drug target identification. Front. Genet. 2018, 9, 242. [Google Scholar] [CrossRef] [PubMed]
  2. Xuan, P.; Sun, C.; Zhang, T.; Ye, Y.; Shen, T.; Dong, Y. Gradient boosting decision tree-based method for predicting interactions between target genes and drugs. Front. Genet. 2019, 10, 459. [Google Scholar] [CrossRef] [PubMed]
  3. Landry, Y.; Gies, J.-P. Drugs and their molecular targets: An updated overview. Fundam. Clin. Pharmacol. 2008, 22, 1–18. [Google Scholar] [CrossRef] [PubMed]
  4. Yamanishi, Y.; Kotera, M.; Kanehisa, M.; Goto, S. Drug-target interaction prediction from chemical, genomic and pharmacological data in an integrated framework. Bioinformatics 2010, 26, i246–i254. [Google Scholar] [CrossRef] [PubMed]
  5. Wang, L.; You, Z.H.; Chen, X.; Li, J.Q.; Yan, X.; Zhang, W.; Huang, Y.A. An ensemble approach for large-scale identification of protein-protein interactions using the alignments of multiple sequences. Oncotarget 2017, 8, 5149. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  6. Zhu, S.; Bing, J.; Min, X.; Lin, C.; Zeng, X. Prediction of drug–gene interaction by using Metapath2vec. Front. Genet. 2018, 9, 248. [Google Scholar] [CrossRef]
  7. Wang, L.; You, Z.-H.; Zhou, X.; Yan, X.; Li, H.-Y.; Huang, Y.-A. NMFCDA: Combining randomization-based neural network with non-negative matrix factorization for predicting CircRNA-disease association. Appl. Soft Comput. 2021, 110, 107629. [Google Scholar] [CrossRef]
  8. Wang, L.; Yan, X.; You, Z.-H.; Zhou, X.; Li, H.-Y.; Huang, Y.-A. SGANRDA: Semi-supervised generative adversarial networks for predicting circRNA–disease associations. Brief. Bioinform. 2021, 22, bbab028. [Google Scholar] [CrossRef]
  9. Casañola-Martin, G.M.; Marrero-Ponce, Y.; Khan, M.T.H.; Khan, S.B.; Torrens, F.; Pérez-Jiménez, F.; Rescigno, A.; Abad, C. Bond-Based 2D Quadratic Fingerprints in QSAR Studies: Virtual and In vitro Tyrosinase Inhibitory Activity Elucidation. Chem. Biol. Drug Des. 2010, 76, 538–545. [Google Scholar] [CrossRef]
  10. Kar, S.; Roy, K. Development and validation of a robust QSAR model for prediction of carcinogenicity of drugs. Indian J. Biochem. Biophys. 2011, 48, 111–122. [Google Scholar]
  11. Rarey, M.; Kramer, B.; Lengauer, T.; Klebe, G. A fast flexible docking method using an incremental construction algorithm. J. Mol. Biol. 1996, 261, 470–489. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  12. Wallach, I.; Jaitly, N.; Nguyen, K.; Schapira, M.; Lilien, R. Normalizing molecular docking rankings using virtually generated decoys. J. Chem. Inf. Modeling 2011, 51, 1817. [Google Scholar] [CrossRef] [PubMed]
  13. Wang, L.; You, Z.H.; Chen, X.; Yan, X.; Liu, G.; Zhang, W. RFDT: A Rotation Forest-based Predictor for Predicting Drug-Target Interactions Using Drug Structure and Protein Sequence Information. Curr. Protein Pept. Sci. 2018, 19, 445–454. [Google Scholar] [CrossRef] [PubMed]
  14. Zhao, L.; Wang, J.; Pang, L.; Liu, Y.; Zhang, J. GANsDTA: Predicting Drug-Target Binding Affinity Using GANs. Front. Genet. 2019, 10, 1243. [Google Scholar] [CrossRef] [PubMed]
  15. Yang, X.; Kui, L.; Tang, M.; Li, D.; Wei, K.; Chen, W.; Miao, J.; Dong, Y. High-throughput transcriptome profiling in drug and biomarker discovery. Front. Genet. 2020, 11, 19. [Google Scholar] [CrossRef]
  16. Wang, L.; You, Z.-H.; Huang, D.-S.; Li, J.-Q. MGRCDA: Metagraph Recommendation Method for Predicting CircRNA-Disease Association. In IEEE Transactions on Cybernetics; IEEE: Piscataway, NJ, USA, 2021; pp. 1–9. [Google Scholar]
  17. Wang, L.; You, Z.-H.; Li, J.-Q.; Huang, Y.-A. IMS-CDA: Prediction of CircRNA-Disease Associations From the Integration of Multisource Similarity Information With Deep Stacked Autoencoder Model. In IEEE Transactions on Cybernetics; IEEE: Piscataway, NJ, USA, 2020. [Google Scholar]
  18. Li, H.-Y.; You, Z.-H.; Wang, L.; Yan, X.; Li, Z.-W. DF-MDA: An effective diffusion-based computational model for predicting miRNA-disease association. Mol. Ther. 2021, 29, 1501–1511. [Google Scholar] [CrossRef]
  19. Lan, W.; Wang, J.; Li, M.; Wu, F.-X.; Pan, Y. Predicting drug-target interaction based on sequence and structure information. IFAC-PapersOnLine 2015, 48, 12–16. [Google Scholar] [CrossRef]
  20. Cao, D.-S.; Liu, S.; Xu, Q.-S.; Lu, H.-M.; Huang, J.-H.; Hu, Q.-N.; Liang, Y.-Z. Large-scale prediction of drug–target interactions using protein sequences and drug topological structures. Anal. Chim. Acta 2012, 752, 1–10. [Google Scholar] [CrossRef]
  21. Yamanishi, Y.; Araki, M.; Gutteridge, A.; Honda, W.; Kanehisa, M. Prediction of drug-target interaction networks from the integration of chemical and genomic spaces. Bioinformatics 2008, 24, I232–I240. [Google Scholar] [CrossRef]
  22. Schomburg, I.; Chang, A.; Ebeling, C.; Gremse, M.; Heldt, C.; Huhn, G.; Schomburg, D. BRENDA, the enzyme database: Updates and major new developments. Nucleic Acids Res. 2004, 32, D431–D433. [Google Scholar] [CrossRef] [Green Version]
  23. Kanehisa, M.; Goto, S.; Hattori, M.; Aoki-Kinoshita, K.F.; Itoh, M.; Kawashima, S.; Katayama, T.; Araki, M.; Hirakawa, M. From genomics to chemical genomics: New developments in KEGG. Nucleic Acids Res. 2006, 34, D354–D357. [Google Scholar] [CrossRef] [PubMed]
  24. Kanehisa, M.; Goto, S.; Furumichi, M.; Tanabe, M.; Hirakawa, M. KEGG for representation and analysis of molecular networks involving diseases and drugs. Nucleic Acids Res. 2009, 38 (Suppl. 1), D355–D360. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  25. Gunther, S.; Kuhn, M.; Dunkel, M.; Campillos, M.; Senger, C.; Petsalaki, E.; Ahmed, J.; Urdiales, E.G.; Gewiess, A.; Jensen, L.J.; et al. SuperTarget and Matador: Resources for exploring drug-target relationships. Nucleic Acids Res. 2008, 36, D919–D922. [Google Scholar] [CrossRef] [PubMed]
  26. Wishart, D.S.; Knox, C.; Guo, A.C.; Cheng, D.; Shrivastava, S.; Tzur, D.; Gautam, B.; Hassanali, M. DrugBank: A knowledgebase for drugs, drug actions and drug targets. Nucleic Acids Res. 2008, 36, D901–D906. [Google Scholar] [CrossRef] [PubMed]
  27. Jones, D.T. Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol. 1999, 292, 195–202. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  28. Chen, X.-W.; Jeong, J.C. Sequence-based prediction of protein interaction sites with an integrative method. Bioinformatics 2009, 25, 585–591. [Google Scholar] [CrossRef] [PubMed]
  29. Jones, D.T.; Ward, J.J. Prediction of disordered regions in proteins from position specific score matrices. Proteins Struct. Funct. Bioinform. 2003, 53, 573–578. [Google Scholar] [CrossRef]
  30. Gao, Z.G.; Wang, L.; Xia, S.X.; You, Z.H.; Yan, X.; Zhou, Y. Ens-PPI: A Novel Ensemble Classifier for Predicting the Interactions of Proteins Using Autocovariance Transformation from PSSM. Biomed Res. Int. 2016, 2016, 8. [Google Scholar] [CrossRef] [Green Version]
  31. Wang, L.; You, Z.-H.; Xia, S.-X.; Chen, X.; Yan, X.; Zhou, Y.; Liu, F. An improved efficient rotation forest algorithm to predict the interactions among proteins. Soft Comput. 2017, 22, 3373–3381. [Google Scholar] [CrossRef]
  32. Altschul, S.F.; Madden, T.L.; Schaffer, A.A.; Zhang, J.; Zhang, Z.; Miller, W.; Lipman, D.J. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 1997, 25, 3389–3402. [Google Scholar] [CrossRef] [Green Version]
  33. Huang, G.B.; Wang, D.H.; Lan, Y. Extreme learning machines: A survey. Int. J. Mach. Learn. Cybern. 2011, 2, 107–122. [Google Scholar] [CrossRef]
  34. Wang, L.; You, Z.-H.; Yan, X.; Xia, S.-X.; Liu, F.; Li, L.-P.; Zhang, W.; Zhou, Y. Using Two-dimensional Principal Component Analysis and Rotation Forest for Prediction of Protein-Protein Interactions. Sci. Rep. 2018, 8, 12874. [Google Scholar] [CrossRef] [PubMed]
  35. Ghadermarzi, S.; Li, X.; Li, M.; Kurgan, L. Sequence-Derived Markers of Drug Targets and Potentially Druggable Human Proteins. Front. Genet. 2019, 10, 1075. [Google Scholar] [CrossRef] [PubMed]
  36. Yang, J.; Zhang, D.; Frangi, A.F.; Yang, J.Y. Two-dimensional PCA: A new approach to appearance-based face representation and recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2004, 26, 131–137. [Google Scholar] [CrossRef] [Green Version]
  37. Cao, D.-S.; Liang, Y.-Z.; Xu, Q.-S.; Hu, Q.-N.; Zhang, L.-X.; Fu, G.-H. Exploring nonlinear relationships in chemical data using kernel-based methods. Chemom. Intell. Lab. Syst. 2011, 107, 106–115. [Google Scholar] [CrossRef]
  38. Cao, D.-S.; Xu, Q.-S.; Liang, Y.-Z.; Chen, X.; Li, H.-D. Prediction of aqueous solubility of druglike organic compounds using partial least squares, back-propagation network and support vector machine. J. Chemom. 2010, 24, 584–595. [Google Scholar] [CrossRef]
  39. Cheng, F.; Liu, C.; Jiang, J.; Lu, W.; Li, W.; Liu, G.; Zhou, W.; Huang, J.; Tang, Y. Prediction of Drug-Target Interactions and Drug Repositioning via Network-Based Inference. PLoS Comput. Biol. 2012, 8, e1002503. [Google Scholar] [CrossRef] [Green Version]
  40. Gonen, M. Predicting drug-target interactions from chemical and genomic kernels using Bayesian matrix factorization. Bioinformatics 2012, 28, 2304–2310. [Google Scholar] [CrossRef]
  41. Temerinac-Ott, M.; Naik, A.W.; Murphy, R.F. Deciding when to stop: Efficient experimentation to learn to predict drug-target interactions. BMC Bioinform. 2015, 16, 1–10. [Google Scholar] [CrossRef]
  42. Öztürk, H.; Ozkirimli, E.; Özgür, A. A comparative study of SMILES-based compound similarity functions for drug-target interaction prediction. BMC Bioinform. 2016, 17, 1–11. [Google Scholar] [CrossRef] [Green Version]
  43. Van, L.T.; Marchiori, E. Predicting Drug-Target Interactions for New Drug Compounds Using a Weighted Nearest Neighbor Profile. PLoS ONE 2013, 8, e66952. [Google Scholar]
  44. Chen, H.; Zhang, Z. A Semi-Supervised Method for Drug-Target Interaction Prediction with Consistency in Networks. PLoS ONE 2013, 8, e62975. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Figure 1. Schematic diagram of the structure of CNN.
Figure 1. Schematic diagram of the structure of CNN.
Biology 11 00740 g001
Figure 2. ROC of 5CV mapped by MSPEDTI on enzyme dataset.
Figure 2. ROC of 5CV mapped by MSPEDTI on enzyme dataset.
Biology 11 00740 g002
Figure 3. ROC of 5CV mapped by MSPEDTI on ion channel dataset.
Figure 3. ROC of 5CV mapped by MSPEDTI on ion channel dataset.
Biology 11 00740 g003
Figure 4. ROC of 5CV mapped by MSPEDTI on GPCR dataset.
Figure 4. ROC of 5CV mapped by MSPEDTI on GPCR dataset.
Biology 11 00740 g004
Figure 5. ROC of 5CV mapped by MSPEDTI on nuclear receptor dataset.
Figure 5. ROC of 5CV mapped by MSPEDTI on nuclear receptor dataset.
Biology 11 00740 g005
Figure 6. ROC curves plotted by the 2DPCA descriptor model on ion channel.
Figure 6. ROC curves plotted by the 2DPCA descriptor model on ion channel.
Biology 11 00740 g006
Figure 7. ROC curves plotted by the SVM classifier model on ion channel.
Figure 7. ROC curves plotted by the SVM classifier model on ion channel.
Biology 11 00740 g007
Table 1. Statistical information for the four gold-standard datasets: the number of target proteins, drugs, and interaction pairs. Sparsity is the ratio of positive DTIs to all possible interactions.
Table 1. Statistical information for the four gold-standard datasets: the number of target proteins, drugs, and interaction pairs. Sparsity is the ratio of positive DTIs to all possible interactions.
DatasetTarget ProteinsDrugsInteractionsSparsity
Enzymes66444529260.0099
Ion Channels20421014670.0344
GPCRs952236350.0299
Nuclear Receptors2654900.0641
Table 2. MSPEDTI outcomes for 5CV on enzyme dataset.
Table 2. MSPEDTI outcomes for 5CV on enzyme dataset.
Test SetAccu. (%)Sen. (%)Prec. (%)MCC (%)AUC (%)
194.8791.2398.7590.0495.12
294.2793.1495.2688.5794.77
393.8589.8097.7887.9994.32
494.0293.0794.7188.0493.98
593.9492.3395.1587.9193.68
Average94.19 ± 0.4191.91 ± 1.4196.33 ± 1.8188.51 ± 0.8994.37 ± 0.59
Table 3. MSPEDTI outcomes for 5CV on ion channel dataset.
Table 3. MSPEDTI outcomes for 5CV on ion channel dataset.
Test SetAccu. (%)Sen. (%)Prec. (%)MCC (%)AUC (%)
190.1788.4491.5580.3889.99
289.8390.7089.5179.6590.14
392.2090.2694.5684.5091.66
490.5191.8689.4481.0590.46
592.0690.2793.7384.1892.15
Average90.95 ± 1.1090.31 ± 1.2391.76 ± 2.3681.95 ± 2.2490.88 ± 0.97
Table 4. MSPEDTI outcomes for 5CV on GPCR dataset.
Table 4. MSPEDTI outcomes for 5CV on GPCR dataset.
Test SetAccu. (%)Sen. (%)Prec. (%)MCC (%)AUC (%)
186.6192.6882.0173.8985.37
289.7695.7487.1079.5391.90
388.9895.5882.4478.8288.46
488.1992.8684.7876.7489.39
586.2293.9482.1273.0785.00
Average87.95 ± 1.5194.16 ± 1.4583.69 ± 2.2276.41 ± 2.8888.02 ± 2.88
Table 5. MSPEDTI outcomes for 5CV on nuclear receptor dataset.
Table 5. MSPEDTI outcomes for 5CV on nuclear receptor dataset.
Test SetAccu. (%)Sen. (%)Prec. (%)MCC (%)AUC (%)
191.6786.96100.0084.0594.98
280.5685.7170.5961.5184.74
388.8985.0094.4478.2685.63
483.3383.3383.3366.6783.02
586.1186.6781.2571.8184.76
Average86.11 ± 4.3985.53 ± 1.4585.92 ± 11.5672.46 ± 8.9786.63 ± 4.77
Table 6. Comparison results of the 2DPCA descriptor model and MSPEDTI on ion channel.
Table 6. Comparison results of the 2DPCA descriptor model and MSPEDTI on ion channel.
Test SetAccu. (%)Sen. (%)Prec. (%)MCC (%)AUC (%)
184.7584.9084.9069.4986.41
282.0382.3180.0064.0281.24
382.3782.8482.8464.7283.35
480.6884.2378.9361.4781.22
582.7782.0083.6765.5683.12
Average82.52 ± 1.4783.26 ± 1.2582.07 ± 2.5265.05 ± 2.9183.07 ± 2.12
MSPEDTI90.95 ± 1.1090.31 ± 1.2391.76 ± 2.3681.95 ± 2.2490.88 ± 0.97
Table 7. Comparison outcomes of SVM model and MSPEDTI on ion channel.
Table 7. Comparison outcomes of SVM model and MSPEDTI on ion channel.
Test SetAccu. (%)Sen. (%)Prec. (%)MCC (%)AUC (%)
185.7690.1481.4271.8185.08
285.9389.0482.7071.9487.90
385.7687.3484.0471.4684.80
486.6189.4983.7373.3487.10
588.3489.2687.4176.7088.33
Average86.48 ± 1.1089.05 ± 1.0483.86 ± 2.2473.05 ± 2.1686.64 ± 1.62
MSPEDTI90.95 ± 1.1090.31 ± 1.2391.76 ± 2.3681.95 ± 2.2490.88 ± 0.97
Table 8. Comparison of AUC with previous methods in the gold-standard dataset.
Table 8. Comparison of AUC with previous methods in the gold-standard dataset.
MethodEnzymesIon ChannelsGPCRsNuclear Receptors
SIMCOMP86.3077.6086.7085.60
NLCS83.7075.3085.3081.50
Temerinac-Ott83.2079.9085.7082.40
Yamanishi82.1069.2081.1081.40
KBMF2K83.2079.9085.7082.40
WNN-GIP86.1077.5087.2083.90
DBSI80.7580.2980.2275.78
NetCBP82.5180.3482.3583.94
MSPEDTI94.3790.8888.0286.63
Table 9. Top 10 DTI pairs predicted by MSPEDTI.
Table 9. Top 10 DTI pairs predicted by MSPEDTI.
Drug IDDrug NameTaregt Protein IDTarget Protein NameValidation Source
D00951Medroxyprogesteroneacetatehsa2099ESR1_HUMANSuperTarget
D00542Bromochlorotrifluoroethanehsa1571CP2E1_HUMANSuperTarget
D03365Transdermal Nicotinehsa1137ACHA4_HUMANSuperTarget
D00049Nikotinsaeurehsa 8843G109B_HUMANSuperTarget
D00160Epsilcapraminehsa7298TYSY_HUMANunconfirmed
D00771Chlorzoxazonehsa1374CPT1A_HUMANunconfirmed
D00139Xanthotoxinehsa1543CP1A1_HUMANSuperTarget
D00964Letrozolehsa1215CMA1_HUMANunconfirmed
D00585Mifepristonehsa2099ESR1_HUMANSuperTarget
D00437Nifedipine Monohydrochloridehsa1559CP2C9_HUMANSuperTarget
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Wang, L.; Wong, L.; Chen, Z.-H.; Hu, J.; Sun, X.-F.; Li, Y.; You, Z.-H. MSPEDTI: Prediction of Drug–Target Interactions via Molecular Structure with Protein Evolutionary Information. Biology 2022, 11, 740. https://doi.org/10.3390/biology11050740

AMA Style

Wang L, Wong L, Chen Z-H, Hu J, Sun X-F, Li Y, You Z-H. MSPEDTI: Prediction of Drug–Target Interactions via Molecular Structure with Protein Evolutionary Information. Biology. 2022; 11(5):740. https://doi.org/10.3390/biology11050740

Chicago/Turabian Style

Wang, Lei, Leon Wong, Zhan-Heng Chen, Jing Hu, Xiao-Fei Sun, Yang Li, and Zhu-Hong You. 2022. "MSPEDTI: Prediction of Drug–Target Interactions via Molecular Structure with Protein Evolutionary Information" Biology 11, no. 5: 740. https://doi.org/10.3390/biology11050740

APA Style

Wang, L., Wong, L., Chen, Z. -H., Hu, J., Sun, X. -F., Li, Y., & You, Z. -H. (2022). MSPEDTI: Prediction of Drug–Target Interactions via Molecular Structure with Protein Evolutionary Information. Biology, 11(5), 740. https://doi.org/10.3390/biology11050740

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop