SmileGNN: Drug-Drug Interaction Prediction Based on SMILES and Graph Neural Network

Background ： The use of multiple drugs at the same time can lead to unexpected adverse drug reactions. The interaction between drugs can be confirmed by routine in vitro and clinical trials. But it is difficult to test the drug-drug interaction widely and effectively before the drug is put into market. Therefore, the prediction of drug-drug interaction has become an important research in biomedical field. Results ： In recent years, researchers have used deep learning to predict drug-drug interaction by using drug structural features and graph theory, and they have achieved a series of achievements. A drug-drug interaction prediction model SmileGNN is proposed in this paper. The structural features of drugs are constructed by using SMILES data. The topological features of drugs in knowledge graph are obtained by graph neural network. The structural and topological features of drugs are aggregated to predict the interaction of new drug pairs. Conclusions ： The experimental results show that the model proposed in this paper combines a variety of data sources, and has better prediction performance compared with the existing prediction model of drug-drug interaction prediction. The most striking result is that five out of top ten predicted new interaction of drugs are verified from the latest database, which proves the credibility of SmileGNN.


Introduction
Drug-drug Interaction (DDI) is one of the focuses of biomedical research. For many diseases with complex pathways of action, single drugs may not be 2 ideal for treatment. One solution is combination drug therapy, which uses several drugs at the same time. For instance, Venetoclax and Idasanutlin are used together to treat leukemia. Venetoclax inhibits the anti-apoptotic Bcl-2 family protein and Idasanutlin activates the p53 pathway, which are effective [1]. However, concurrent use of multiple drugs may lead to Adverse Drug Events (ADEs) [2,3]. Routine in vitro and clinical trials confirm DDIs. But it is difficult to test DDIs extensively and effectively before they are marketed, because it is impossible to test every two drugs for DDIs considering the large number of drugs and the time and cost of validation. At the same time, due to the fact that ADEs are not always reported and counted in time after the occurrence, there are relatively few documented and verified DDIs compared with the large number of drugs.
At present, DDI prediction methods are mainly divided into two categories: drug stuctural feature based approach and graph-based approach.
The drug stuctural feature based approach assumes that chemically similar drugs have similar DDIs. Ryu et al. [4] proposed DeepDDI model, which is the first model to use deep learning in drug-drug predition. Structural Similarity Profiles (SSP) of pairs of drugs are generated by using SMILES (Simplified Molecular Input Line Entry Specification) data of the drugs, which are then sent into Deep Neural Network (DNN) for classification through PCA dimensification reduction. On the basis of DeepDDI, Lee et al. [5] added two new data with the method similar to the SSP generated by drugs' SMILES data: Target Gene data to generate TSP (Target Similarity Profile) and Gene Ontology (GO) to generate GSP (Gene Ontology Term Similarity Profile). These three feature vectors are reduced in dimension by an improved encoder, and then stitch into a single feature vector for the drug pair, which is put into DNN for training. This improved model uses more data and has higher accuracy. Based on the DeepDDI, a polymorphic deep learning model was proposed by Deng et al. [6], which uses the complete information screened for training. It can use the information related to a variety of drugs to learn more efficiently and has a higher accuracy. The methods based on drug features have high accuracy on known data sets, but they have some limitations. The hypothesis that "drugs with similar chemical structures have similar DDIs" has not been scientifically verified. Thus, there may be a large deviation in the prediction results in actual clinical verification. 3 In recent years, a series of studies on the application of graph theory in molecular level have achieved successful results, and many researchers are trying to use graph theory to analyze DDI prediction. Marinka et al. [7] proposed the model Decagon, which is a two-layer heterograph. It was constructed to predict the type of polypharmacy side effects of drug pairs whose drug targets are all proteins. In this study, the Graph Neural Network (GNN) was used to train the model by Graph representation learning, and it was shown that the GNN has better performance in predicting DDIs than the traditional shallow Graph structure model and the traditional graph embedding method. Bougiatiotis et al. [8] extracted the three dimensional relationships related to a specific disease from various databases, and expressed them with The Unified Medical Language System (UMLS) to construct multiple knowledge graphs (KG) for specific diseases. The model DDI-BLKG extracts drug features based on its pathways, which has a certain enlightenment for the prediction of DDIs. Lin et al. [9] extracted a large number of drug-related data from the database, and processed data into triples. The triples are encoded to construct a huge KG. The feature vector of the drug is generated through two times of aggregation by GNN. Thus, the vector includes not only the information of the drug itself, but also the information of drug-related entities. The method based on graph models drug action pathway and other data, and uses deep learning and other methods to make training prediction. The graph-based method has a good explanatory ability but sometimes neglects the information contained in the entities.
Graph Neural Network (GNN) extends the convolutional neural network to non-Euclidean space, which provides a more natural and effective method for the modeling of graph structured data [10]. GNN can be regarded as an embedding method, which extracts the embedding vectors of adjacent nodes for updating its own embedding vectors, without the need for manual feature engineering [11].
Knowledge Graph (KG), as a knowledge representation and management method, was proposed by Google in 2012. In recent years, KG has become popular in academia and industry, and its use has expanded from the search engine field to all fields involving big data [18]. KG is a kind of data structure based on graph, and is usually represented as triples, G = (head, relation, tail). Head and tail are the head entity and tail entity respectively, which are different entities from the form of web pages. Relation, on the other hand, is 4 the relation in the knowledge base, which is transformed from hyperlink to web page into semantic relation between entities.

Drug Structural Features
The data sources for this topic is DrugBank [13]. DrugBank is a drug knowledge database that describes clinical information on drugs, such as side effects, DDIs, etc. DrugBank also provides data on the molecular level, such as the chemical structure of the drug, the target protein of the drug, etc. SMILES (Simplified Molecular Input Line Entry Specification) is a specification that explicitly describes molecular structures using ASCII strings. SMILES can describe a three-dimensional chemical structure with a string of characters, as shown in Figure 1 is a two-dimensional graph of the drug Leucovorin and its corresponding SMILES. SMILES can be imported by molecular editing software and converted into two-dimensional graphics or three-dimensional models of molecules.  [14] method was proposed to apply Seq2seq [15] technology in natural language processing to SMILES string, in which the chemical structure information is used as input variable into the deep neural network to predict the physical properties of compounds. SMILES2Vec removes some of the long (more than 250 letters) SMILES during preprocessing, and conducts one-hot coding on the remaining SMILES, converting each SMILES into a vector of length 26. According to this pretreatment method, the chemical structure of the drug was pretreated, as shown in Figure 2. All the SMILES stored in DrugBank are converted into a word bag with 251 elements in it. Then one-hot encoding is used to transform them into 251 dimensional vectors. Finally, PCA is used to reduce the 251-dimensional SMILES vectors to a specific dimension, that is, a vector of lower dimension used to represent the structure of a drug.

Drug Topological Features
Construction of KG. The data from two databases are used to construct KG, which are used to obtain the topological features of the drugs on the corresponding one. Kyoto Encyclopedia of Genes and Genomes (KEGG) [16] is a database resource for understanding advanced functions and utilities of biological systems from molecular level information. There are multiple sub- 6 databases under KEGG. Wang et al. [17] constructed a large, high-quality heterogeneous map linking Patient, Disease, and Drug (PDD) in Electronic Medical Record (EMR). PDD database extracts key medical entities from MIMIC-III (Medical Information Mart for Intensive Care III) [18] and linked them to current biomedical knowledge graphs (including ICD-9 Ontology and DrugBank). PDD diagrams are accessible on the web through SPARQL endpoints and provide information for medical research and treatment recommendations.
RDF (Resource Description Framework) [19] is a resource description language commonly used as a representation of the KG. Bio2RDF project [20] provides tools to convert data to n-quads or other formats of RDF. Then, the RDFlib library is used to parse these n-quads data and divide them into triples (entity, relationship, entity) in a format that is convenient for subsequent KG to generate embedded features, as shown in Figure 3.

Figure 3 KG Construction
Here we introduce a metric to evaluate the KG. Density is used to describe the density of edge connections between nodes in a graph/network. For a graph G with L edges and N nodes, the density calculation formula is shown in (1) : The density of the graph has a certain influence on the results of graphbased research and machine learning. This will be discussed in subsequent experiments. 7 We construct two KGs by KEGG and PDD respectively. The corresponding data is shown in the following Table 1. It can be seen from the table that there are more types of drugs in KEGG data set, but the graph itself is relatively sparse, and the proportion of drugs with structure records is relatively low. PDD dataset has fewer drug types, but the graph is more dense, and the proportion of drugs with structure records is higher.
It should be noted that the positive and negative samples in the experiment are not the result of manual labeling, but come from the existing data in the database. For negative samples, this paper believes that there is no DDI between the two drugs in the experiment, but in fact there may be DDI between them, which has not been clinically verified, so it has not been recorded in the database.
Extraction of topological features. Generally, the models that use KG to predict DDIs can only capture data information in a small range. In order to expand the receptive field, obtain the rich entity information in the KG, and explore the potential correlation between drugs and other entities, KGNN [9] model was proposed. KGNN extracts the higher-order structure and semantic relations of drugs by GNN, and learns the representation of drugs and 8 their neighborhoods from the KG. We use KGNN model to calculate the topological features of drugs on the KG, as shown in Figure 4. For each entity, the model extracts several entities from the domain of the entity and aggregates the information of these entities to form the topological feature representation of the entity. There are three kinds of entity aggregation methods: sum aggregation is a superposition operation, concatenate is a concatenate operation, neighbor only considers the neighborhood without considering the information of the node itself. These three aggregation methods are abbreviated as sum, concat and neigh.

Drug-drug Interaction Prediction
We consider using GNN to obtain drug topological features on the KG, and fuse drug structural features into the model to study the influence of drug structural features on DDI prediction. Hence, we propose a novel model Smi-leGNN as shown in Figure 5.
The algorithm can be summarized as follows. The method SMILES2Vec mentioned in section 2.1 is used to calculate the feature vector of drug 9 structure by using the data of SMILES. The KGNN model is retained to calculate the drug topological features, in which the graph neural network (GNN) is used to aggregate the entity information of the receptive field within two hops of the entity to obtain the drug topological features. Then the two features of the drug are aggregated to obtain a comprehensive drug features including drug topological features and drug structural features. Two algorithms are specifically designed to aggregate drug structural features and drug topological features. See section 3.4 for detailed algorithms and comparative analysis.
On this basis, the interaction value between the two drugs is calculated. After passing the threshold value of 0.5, it is classified as the presence of DDI or the absence of DDI.
Where, , ̂represents the predicted value, , represents the true value of drug pairs in the data set, and Y represents the set of all drug pairs. 10

Experimental Settings
In this paper, the prediction of DDI is considered to be a binary task. It is not necessary to specifically predict the type of DDI and what side effects the DDI may cause, but only to judge whether there is a possible DDI between the drug pair.
Metrics. ACC (Accuracy) and AUC (Area Under Curve) are mainly used as evaluation indexes for a series of models. In some comparative experiments, F1 is also used as an evaluation index.
Settings. The experiment is conducted on two datasets, KEGG and PDD. See section 2.2 for the construction and data features of the dataset. For the two datasets, a parameter combination that achieves the highest AUC value is adopted through parameter tuning based on grid search. The final parameters to be used are shown in Table 2. Baselines. In addition to KGNN, two classic models, DeepDDI and Decagon are compared with the new model proposed in this paper. See Section 1 for a detailed introduction of the models.

Results and Analysis
The experimental results of these models are compared and analyzed, as shown in Table 3. SmileGNN achieves the best performance among all the models. Compared to the classic DeepDDI and Decagon models, there is a 5.3% and 8.0% improvement in AUC values, respectively. Compared with the KGNN model using drug topological features alone, it also has a certain performance improvement.
According to the experimental results, both DeepDDI model and Decagon model are the pioneer models in the field of DDI prediction. However, the model designs still need to be improved, and their prediction performance is relatively poor. In the graph-based method, both Decagon model and KGNN model only use the topological features of the drug, but KGNN not only considers the topological features of the current node of the drug, but also the topological features of the node in the neighborhood of the drug within a certain range, so more information can be learned from the graph and the effect is improved more than that of Decagon model. The new model Smi-leGNN proposed in this paper combines the topological features and structural features of the drug, and has a better performance than the Decagon and KGNN models using topological features alone or the DeepDDI model using structural features alone.
SmileGNN model retains the method of KGNN model in learning drug topological features, and has excellent performance. However, in terms of the learning of drug structural features, the model proposed in this paper deals with SMILES in a relatively independent and rough way. In the future, the feature expression algorithm of drug structural features can be further optimized to improve the prediction ability of the model.

Ablation Study
SmileGNN adds the use of drug structural features to KGNN, and integrates multi-source information to predict new DDIs. The comparison with the original performance of the KGNN model [9] is an ablation experiment, so as to compare and analyze the influence of the new drug structural features on the model performance.
Experiments are carried out in KEGG and PDD dataset to conduct experiments on the three drug topological feature aggregation types of sum, concat and Neigh respectively. The experimental results are shown in Table 4. For both KEGG and PDD datasets, the performance of SmileGNN, which uses drug structural features, is better than that of KGNN in all three kinds of aggregation methods of drug topological features. Conconsistent with the KGNN model, SmileGNN has the best effect when using concat for obtaining drug topological features, with the AUC value reaching 0.9521 and 0.9642 in KEGG and PDD dataset respectively. It proves that the newly added drug structural features can steadily improve the performance of the model. Table 4-1 with Table 4-2, it can be found that the performance of both KGNN and SmileGNN models on PDD dataset is better than 14 that using KEGG dataset. As for the improvement of model performance after adding SMILES, PDD dataset has the same degree of improvement on the original good results, with about 1% improvement in ACC, AUC and F1 value.

By comparing
According to the comparison of KEGG and PDD datasets in section 2.2, the following conclusions can be basically drawn: 1. On the denser graph, the drug topology information learned from the model is richer and can better represent the drug topological features.
2. In PDD data, there is a higher proportion of drugs corresponding to drug structure, and drug structural features have a greater influence on the model, which is positive.
Due to the limitations of the dataset itself, that is, the drug pairs in the dataset are classified as drug pairs without DDIs, but may also have DDIs. Therefore, the predicted results of the model cannot be infinitely close to 1, and the excellent performance obtained in training and cross-validation does not explain everything. In section 4, special attention is paid to drug pairs that are classified "incorrectly", i.e., those that the datasets records as non-DDIs but the model predicts as DDIs.

Case Study
 Influence of drug feature aggregation method.
Referring to the ways that KGNN designed to aggregate the topological features of multiple nodes together, methods sum and concat are designed to aggregate the structural features and topological features of drugs together by corresponding addition and connection.
Given two matrices as input: drug topological feature matrix A, whose shape is ℎ * ; drug structural feature matrix B, whose shape is ℎ * . For the sum method, the weight matrix of the shape * is designed, and the bias vector b. Notice that the matrices A and B have to have the same shape. Output is shown in formula (3). For concat method, the 15 weight matrix of the shape ( + ) * is designed, and the bias vector b. Output is shown in formula (4).
For PDD dataset, when other parameters are unchanged, the drug topological feature dimension is set as 64 dimension, so is the drug structural feature dimension. The two aggregation methods are used to obtain drug features, and the other parameters are consistent. The experimental results are shown in Table 5. As can be seen from Table 5, when sum and concat are used to aggregate drug topological features and drug structural features, the performance of sum method is slightly better than that of concat method, but the performance gap is not significant. In view of the fact that concat method is more flexible and has no requirement on feature dimension, subsequent experiments all adopted concat method.
 Influence of drug structural feature dimension.
To measure the influence of drug structural feature dimension on the result of model training, and study the loss of PCA dimension reduction method, we conduct the following experiment. The model unified concat method is used to connect drugs topological features and structural features, using the PDD dataset, set the PCA dimension reduction of drug structure dimension respectively 32 d, 64 d, 96 d. Other parameters remain the same.
Among them, three methods of sum, concat and neigh are used to obtain drug topological features respectively, in order to observe whether the influence of drug structural feature dimensions is stable and consistent. As shown in Figure 6, with the enhancement of drug structure characteristic dimension from 32 dimension to 64 dimension, the effect of sum, concat and neigh models, three aggregation methods for drug structures, all improve slightly, indicating that the influence of drug structural feature dimension on model performance is stable and consistent. Note that when the drug structure dimension was increased from 64 to 96 dimensions, the performance of the model is not significantly improved. In conclusion, when PCA is used to reduce the dimension of drug structural features, the effect of dimension reduction is better and the information loss is smaller in the process of decreasing from 251 dimension to 64 dimension. When the dimension is further reduced, the representation of drug structural features may be greatly lost, and the performance of the final model will be affected. Considering when using 64-dimensional drug structural features, the model has already had a relatively good performance, while the use of a higher dimension of drug structural features will occupy more computing resources and storage space, and the performance improvement is not obvious, so the experiments uniformly use 64-dimensional drug structural features.

Discussion
Instead of sending the score of the drug pairs into the threshold category of 0.5, the drug pairs with a score over 0.9 are directly printed and ranked from highest to lowest. To get better result, use PDD dataset to get drugs pairs classified as DDIs, and eliminate pairs which are recorded with DDIs in PDD. Get the highest score of the top ten new prediction of DDIs, and send the results to the latest DrugBank database query. The ones that are recorded as DDI in DrugBank are marked as 1, otherwise marked as 0, as shown in Table  6.
The PDD dataset is updated to version 1.3 and was uploaded in October 2018. The DDIs in the PDD dataset were extracted from version 5.1.1 of DrugBank, which was uploaded in July 2018. The latest DrugBank database is version 5.1.8 uploaded in January 2021. So there's a 2.5-year gap, during which many new DDIs are discovered and verified. It can be seen that the five new DDIs shown in Table 6 are the latest ones that have been clinically verified and included in DrugBank database in recent two years, while the remaining five DDIs have not been experimentally verified yet. The model proposed in this paper is reliable for the prediction of novel DDIs, and the experimental results are of great auxiliary significance for clinical trials of novel DDIs.
In the following, two drug pairs are studied separately, and the influence of drug structural features and drug topological features on drug pair interaction prediction is discussed. It can be seen that both drug pairs [DB00437, DB00959] and [DB00437, DB00633] have high scores above 0.99, and both contain drug DB00437.
According to the SSP calculated in DeepDDI [4], it is known that the structural similarity between drug DB00959 and drug DB00633 is only about 35.19%, which does not have a high similarity. However, only 72% of the drugs in the PDD data set have SMILES data, so about 48% of the drug pairs cannot be directly calculated for their structural similarity. In the context of sparse data, 35.19% similarity also has a greater impact on the results.
In the drug targeting data, it is found that both drug DB00959 and drug DB00633 act on Cytochromes P450 group protein enzymes. Due to the similar pathway of action, the model is more inclined to believe that drug DB00959 and drug DB00437 also have DDIs. DDI records in the DrugBank database show that the adverse drug event of drug combination [DB00437, DB00633] is due to competition for the excretory pathway of the kidney [21]. Based on the relevant information in literature and a series of databases, it is believed that the interaction mechanism of this drug pair is not obviously related to the protein enzymes of Cytochromes P450 group [22,23].
Through the study of this example, it is realized that SmileGNN can make good use of the known drug structural information and drug topological information to predict DDIs. However, on the one hand, the model is limited by 20 insufficient information of drug structure; On the other hand, the learning of topological information in KG is relatively blind and random. In the future, this model still has some room for improvement in learning drug features.

Conclusion
In this paper, a new model SmileGNN (model based on SMILES and graph neural network) is proposed to predict drug-drug interactions by comprehensively using drug structural features and drug topological features. We implement the proposed method and conduct experimental comparisons on two datasets. Through experiments, it is verified that SmileGNN has better performance than the classic models and KGNN. According to the latest database, SmileGNN's prediction results are credible.

Ethics approval and consent to participate
Not applicable.

Consent for publication
Not applicable.

Availability of data and materials
All the codes are available at: https://github.com/AshleyHan/SmileGNN