Learning Multi-Types of Neighbor Node Attributes and Semantics by Heterogeneous Graph Transformer and Multi-View Attention for Drug-Related Side-Effect Prediction

Since side-effects of drugs are one of the primary reasons for their failure in clinical trials, predicting their side-effects can help reduce drug development costs. We proposed a method based on heterogeneous graph transformer and capsule networks for side-effect-drug-association prediction (TCSD). The method encodes and integrates attributes from multiple types of neighbor nodes, connection semantics, and multi-view pairwise information. In each drug-side-effect heterogeneous graph, a target node has two types of neighbor nodes, the drug nodes and the side-effect ones. We proposed a new heterogeneous graph transformer-based context representation learning module. The module is able to encode specific topology and the contextual relations among multiple kinds of nodes. There are similarity and association connections between the target node and its various types of neighbor nodes, and these connections imply semantic diversity. Therefore, we designed a new strategy to measure the importance of a neighboring node to the target node and incorporate different semantics of the connections between the target node and its multi-type neighbors. Furthermore, we designed attentions at the neighbor node type level and at the graph level, respectively, to obtain enhanced informative neighbor node features and multi-graph features. Finally, a pairwise multi-view feature learning module based on capsule networks was built to learn the pairwise attributes from the heterogeneous graphs. Our prediction model was evaluated using a public dataset, and the cross-validation results showed it achieved superior performance to several state-of-the-art methods. Ablation experiments undertaken demonstrated the effectiveness of heterogeneous graph transformer-based context encoding, the position enhanced pairwise attribute learning, and the neighborhood node category-level attention. Case studies on five drugs further showed TCSD’s ability in retrieving potential drug-related side-effect candidates, and TCSD inferred the candidate side-effects for 708 drugs.


Introduction
The side-effects of drugs are defined as effects occurring in the body when the drug is administered at therapeutic doses that are unrelated to its therapeutic purpose, including adverse reactions that may cause the drug to fail in clinical trials [1][2][3].Therefore, providing precise and efficient identification of drug-related side-effect candidates can aid in lowering drug development costs and enhance drug safety [4,5].Computational methods have demonstrated their ability to aid in drug discovery [6] and design [7] (CADD).They can also screen for reliable drug-related side-effect candidates [8][9][10].
The three categories of currently used drug-side-effect association prediction methods are as follows: The first category involves estimation of drug and side-effect association likelihoods based on drug-associated proteins.New indications and adverse reactions are usually caused by unexpected chemical-protein interactions at off-target sites.Therefore, the targeted protein information of the drug is used to predict drug-related side-effects.Compound-protein interaction (CPI) sets [11,12] and drug-protein interactions (DPI) can also be used to infer drug-related side-effect candidates [13].However, this class of method is limited in that only a small fraction of the structural information for the drug-associated proteins is available [14].
A second class of predictive models uses machine learning to screen candidates for drug-related side-effects.To combine data on medications, proteins, and side-effects, five machine learning techniques were used: logistic regression, parsimonious Bayes, k-nearest neighbors, random forest, and support vector machine [15].Approaches to infer potential drug-side-effect associations are based on multi-label learning [16], on multiple kernels learning and least squares [17], on random forests [18], on a random wandering and skip-gram algorithm [19], on feature-derived graph-regularized matrix factorization for predicting drug side-effects (FGRMF) [20], on triple matrix decomposition based on nuclear target alignment [21], and on non-negative matrix factorization [22].Mohsen et al. [23] constructed a framework based on a deep neural network (DNN) for inferring the candidates.However, such models are shallow prediction models which have difficulty in fully extracting the complicated and nonlinear associations between drugs and side-effects.
The third category establishes a prediction model based on deep learning to further enhance prediction performance by extracting the depth and representative features of the drug and side-effect nodes.The training process of a deep learning model usually needs several hours or tens of hours.On the other hand, when the model is applied to inferring the association possibility for a pair of drug and side-effects, it often only needs no more than a second.The newly advanced models make full use of the diverse data related to drug and side-effect nodes for drug-side-effect association prediction, including the similarity and association information of drugs and side-effects as well as the association information of drugs and diseases.Several approaches integrate multi-source data on drugs and side-effects, including through use of graph attention networks [24], a similarity-based deep learning approach for determining the frequencies of drug side-effects (SDPred) using a multi-layer perceptron [25], graph convolutional autoencoders, and convolutional neural networks [26], respectively.Recently, hybrid graph neural network models incorporating graph-embedding and node-embedding modules have been used to model drug-sideeffect associations and to provide candidate predictions [27].Although deep models have shown improvements in drug-side-effect association predictions, the above models cannot adequately fuse the features of the edges between the source and target nodes and do not integrate the rich positional information in the feature embedding of the node pairs.Our model aggregates the information from multiple types of neighbor nodes, and encodes the semantic information of the various connections.Moreover, an attribute learning module is built to learn the pairwise attributes from a multiple capsule perspective.
In this study, we propose a novel prediction model TCSD for integrating the various neighbor attributes, the diverse connection semantics, and the pairwise attributes.TCSD's main contributions are listed as follows: (1) First, two heterogeneous graphs composed of drug and side-effect nodes are constructed by utilizing two types of drug similarities to complement the encoding of the specific topology structure and node attributes of each heterogeneous graph.A target node in each graph has drug neighbor nodes and side-effect nodes, and there are contextual relationship among the attributes of the target node and the attributes of its diverse neighbor nodes.Most previous approaches have focused only on aggregating the information of a single type of neighbor node.A module based on a graph transformer is established to learn category-sensitive attributes for each category of neighbor nodes.(2) Previous approaches did not fully utilize the diverse information of multiple types of connections among the drug and side-effect nodes.In order to improve the node feature-learning capacity in each heterogeneous graph, we design a strategy to integrate the similarity semantic connections between drugs (side-effects) and the association semantic connections between drugs and side-effects.(3) Third, we design two attention mechanisms for the effective fusion of learned information.To adaptively fuse the encoded contextual features from the drug neighbor nodes and the side-effect nodes for each target node, we design the attention at the neighbor category level.Since two heterogeneous graphs make different contributions to drug-related side-effect prediction, we design an attention from the graph perspective to discriminate their contributions.( 4) Finally, we propose a capsule network-based strategy to learn the attributes of a pair of drug and side-effect nodes.The created multiple capsules and the dynamic routing mechanism enhance position information learning in the pairwise attribute embedding.Previous approaches did not integrate the information of the positions in the pairwise embedding.A comprehensive comparison with six state-of-the-art methods and case studies on five drugs showed TCSD's superior performance and its ability in discovering potential association candidates.

Materials and Methods
The new prediction model TCSD is presented in Figure 1.It integrates the multimodality similarities of medications and side-effects, neighbor context encoding, and pairwise feature representation to predict drug-related potential side-effects.First, two drug-side-effect heterogeneous graphs were created based on the associations between drugs and side-effects as well as the multi-modality similarities (Figure 1a).Afterwards, to learn the neighbor context encoding of the target node, we built a transformer-based context encoding (CET) module using a neighbor node category-level and a graph-level attention mechanism (Figure 1b) with detailed structures, as shown in Figure 2. In parallel, a capsule network-based acquisition pairwise multi-view feature (MVF) learning module (Figure 1c) was used to learn the feature map of a pair of drug-side-effect nodes.

Dataset
Public databases [28,29] and papers [26,30] addressing drug-side-effect associations, side-effect similarities, drug chemical substructure similarities, and drug functional similarities were used to gather data on drugs and side-effects.Initially, 80,164 pairs of drug and side-effect associations were retrieved from the SIDER databank [28].We obtained the chemical substructural similarities from the comparative toxicogenomics database [29], which includes the chemical substructures of 708 drugs.The disease-based drug similarities were obtained from a previous study [31].These associations and similarities included 708 drugs, 4192 side-effects, and 5603 diseases.

Multi-Source Data Matrix Representation and Construction of Heterogeneous Graphs 2.2.1. Matrix Representation of Drug-Side-Effect Associations
We created an association matrix A = A i,j ∈ R N r * N s according to the discovered associations of the drug-side-effect node pairs.This matrix illustrates the relationships between N r drugs and N s side-effects.The drugs are represented by the rows of A and the adverse effects are represented by the columns.If a drug r i and side-effect s j are known to be associated, then A i,j = 1.If not, then A i,j = 0.

Matrix Representation of Multi-Modality Similarities of Drugs
When two drugs r i and r j are associated with a greater number of similar diseases, the functional similarity of the two drugs is usually greater.We, therefore, computed the functional similarity D dis i,j between a pair of drug nodes r i and r j based on the diseases they are connected with, in accordance with the work of Wang et al. [31].Similarly, a greater similarity in the chemical substructures of r i and r j indicates a greater similarity between the drugs themselves.Based on this biological premise, D che i,j was calculated based on Luo et al. using the cosine similarity to reflect the similarity of the drug chemical substructures [30].Using the drug-related multi-source data, we obtained a multimodal similarity matrix D ρ for the drug defined as where ρ = che or dis.D ρ i,j is used to denote the ρth similarity of r i and r j .In addition, . The value of D ρ i,j increases with the degree of resemblance between r i and r j .

Matrix Representation of Side-Effect Similarity
A greater number of similar drugs being associated with side-effects s i and s j indicates a greater similarity between s i and s j .We calculated the similarity matrix S = S i,j ∈ R N s * N s of all side-effects based on the approach adopted by Wang et al. [26].With a number between 0 and 1, s i,j indicates how similar side-effect s i and side-effect s j are to one another.The larger the similarity value, the higher the similarity between s i and s j .

Construction of Drug-Side-Effect Heterogeneous Graphs and Attribute Extraction
D che and D dis represent the similarities according to the chemical substructures of the two drugs and diseases that they are associated with, respectively.We created two drugside-effect heterogeneous graphs relying on D che and D dis , respectively, where ρ = che or dis.The set of nodes V = {V r ∪ V s } in each heterogeneous graph comprises the set of drug nodes V r and the set of side-effect nodes V s ; an edge e ρ i,j ∈ E ρ with a weight w ρ i,j ∈ W ρ links a pair of nodes v i , v j .In general, several types of connecting edges can exist between drugs and side-effects, including a drug-side-effect association edge e rs , a drug-drug similarity edge e rr , and a side-effect-side-effect similarity edge e ss .W ρ contains the association matrix A and similarity matrices S and D ρ .The adjacency matrix of the ρth heterogeneous graph is expressed as I ρ , where the total number of nodes is N total = N r + N s and A T denotes the transpose of the matrix A. The i-th row in the matrix I ρ denotes the association and similarity of the node v i with all of the drugs and side-effects, which are considered as node attributes of v i .The attribute vector x ρ i of the drug r i is defined as where ρ = che or dis, and indicates the operation of the first and last link.The i-th row of the matrix A, where each side-effect's association with r i is recorded, is designated by is the row i of the matrix D che D dis containing the chemical substructural (functional) similarities with all drugs.
Similarly, the attribute vector of the side-effect s j is represented as y j , where A ,j S ,j denotes the connection with the association (similarity) of s j and all drugs (side-effects).The feature embedding matrix Z ρ of the node pairs r i and s j is defined as where 2 * N total is the dimension of Z ρ .

Context Representation Learning Based on Transformer with Attention
The target node attributes are contextually linked to the attributes of the neighbors of each category in their neighborhood.In order to learn the context representations of the nodes, we designed the CET module based on a graph-level attention mechanism to aggregate information regarding its neighbor nodes.As each heterogeneous graph has a unique topology, we used a graph transformer (GT) module (Figure 2) for G che and G dis .The semantic information of the similarity or association connection edges between the neighbor node and target node was used to learn the corresponding neighborhood context representation.The module comprised l e coding levels; layer l can serve as an illustration of how the context is learned.The CET module's drug node and side-effect node learning processes were similar; an example is described for drug r i .

Neighborhood Node Set Extraction
Based on the similarity between the drug r i and all drugs, we obtained the top N t most similar neighbors to r i .If N t = 4, let r i , r a , r b , and r c be the four top neighbor nodes, and their attribute vectors be x ρ i , x ρ a , x ρ b , and x ρ c , respectively.The set of attribute vectors of the drug neighbor nodes of r i is denoted as S r i ,r , Similarly, we can obtain all of the N k side-effect neighbor nodes associated with r i .When N k = 3, the N k side-effect neighbors of r i are s a , s b , and s c , with y a , y b , and y c being their attribute vectors, respectively.Thus, the set of attribute vectors of the side-effect neighbor nodes of r i is represented as S r i ,s , S r i ,s = {y a , y b , y c }. (7) is the set of drug-like neighbor node attribute vectors for r i .Inspired by Transformer, we mapped the attribute vector x ρ i of r i to a query vector space and S r i ,r to a key vector space and value vector space.To reduce the bias in the contextual semantic learning process, we established a multi-headed attention mechanism.In the t-th attention head, because each drug-like neighbor contributes differently to r i , we employed a neighbor node-level attention mechanism to calculate the attention weights of r i for each neighbor.The output query vectors of the layer 1 and layer l coding layers are q ρ,1 where W 1 t,Q ∈ R n * N total and W l t,Q ∈ R n * N total are the weight matrices of layer 1 and layer l, respectively.c ρ,l−1 i is the vector of the encoded information of r i obtained in layer l − 1; l e is the number of layers of the encoding layer.We calculate the key matrix K ρ,l t ∈ R 4 * n and value matrix V ρ,l t ∈ R 4 * n for r i as follows: where W l t,K and W l t,V are the weight matrices.represents the splicing between two vectors.c

Contextual Encoding of Nodes of the Same Type
All of the drug-type neighbor nodes of drug r i form the set {r i , r m , m = a, b, c}, and a contextual connection exist between the node properties of r i and the properties of these neighbor nodes.Therefore, we must gather information about the neighbors of r i to update the attribute vector of r i .We calculate the attention score of r v to r i as α ρ,l where v = i, a, b or c.W l t,D ∈ R n * n is a weight matrix specific to the drug-like neighbor nodes of r i for fusing the corresponding semantic information for each connection (similarity connection or association connection).Then, for the neighborhood nodes r i , r a , r b , and r c of r i , and the obtained α ρ,l t (r i , r i ), α ρ,l t (r i , r a ), α ρ,l t (r i , r b ), and α ρ,l t (r i , r c ), the normalized attention weight is obtained as γ where exp is an exponential function.The drug-like neighbor encoding information y ρ,l t,e rr (r i ) of r i can be represented as, where y ρ,l t,e rr (r i ) ∈ R n .Finally, the context encoding y l e rr [r i ] ∈ R nT at the drug neighbor node level of r i is defined as, t,e rr (r i ), (15) where denotes the first and last join of the T-head attention encoding vector.Similarly, for the set {s a , s b , s c } of the side-effect neighbor nodes of r i , we can obtain the context encoding y ρ,l e rs (r i ) specific to that class of neighbor nodes.

Neighborhood Node Category-Level and Graph-Level Attention Mechanisms
Since the drug node r i has two types of neighbor nodes, which are drug and sideeffects, we learn the context encodings y ρ,l e rr (r i ) and y ρ,l e rs (r i ) of r i , respectively.As y ρ,l e rr (r i ) and y ρ,l e rs (r i ) differ in their learning contributions to the final contextual representations of r i , we propose a neighborhood node category-level attention mechanism.The attention score is obtained as, s where u ∈ {r, s}, W u,nei is the weight matrix of the first-class neighbor nodes; h The contextual encoding of r i , as enhanced by the attention mechanism, is obtained as where Z ρ,l con (r i ) ∈ R nT .The encoding result Z ρ,l e con (r i ) ∈ R n f in obtained by the l e -th layer GT contains contextual information regarding the two types of neighbor nodes of r i in the heterogeneous graph G ρ with the discriminative semantics of the connected edge; it is renamed as Z ρ (r i ).
x ρ i contains more detailed information and Z ρ (r i ) carries out learning to obtain the representative neighborhood contextual encoding.Therefore, we added the information from x ρ i to Z ρ (r i ).Given the original attribute vector x ρ i of r i , we first applied a linear projection S − Linear ρ to map it to the attribute space of Z ρ (r i ).Then, we superimposed it with Z ρ (r i ) to obtain a complemented neighbor context encoding as Z add (r i ), where σ is the relu activation function [32].
The heterogeneous graphs G che and G dis were learned by the CET module to obtain the contextual encodings of r i and s j represented as Z ρ add (r i ) and Z ρ add s j ( ρ = che or dis), respectively.Z che add (r i ) Z dis add (r i ) and Z che add s j Z dis add s j were stacked up and down to form Z che add r i − s j ∈ R 2 * n f in Z dis add r i − s j .Z che add r i − s j and Z dis add r i − s j were fused by 1 × 1 convolution to form a contextual representation Z f in r i − s j ∈ R 2 * n f in of the node pair.Z f in (r i ) and Z f in s j were spliced first and last, respectively, to form a feature map Z i,j ∈ R 2n f in of r i − s j node pair.y CET denotes the probability distribution of whether r i and s j are related, where W f is the weight matrix and b f is the bias vector.y CET = (y 0 CET , y 1 CET ), where y 0 CET is the probability that the drug r i and side-effect s j are not associated and y 1 CET is the probability that they are associated.

Local Information Enrichment Strategy for Drug-Side-Effect Node Pair Feature Representation Learning Based on Capsule Networks
Given Z ρ ∈ R 2 * N total , which contains information regarding the similarity and association of r i and s j with all drugs and side-effects and contains 2 * N total elements, we built the MVF capsule network-based module to deeply integrate the characteristics of multiple elements at the same position from multiple views.These characteristics formed a capsule, and all newly created capsules passed through a routing mechanism to further evaluate the association scores of node pairs.The MVF module contained two convolutional layers and two capsule layers.The detailed architecture is given in Figure 3.

Establishment of Primary Capsule Embedding Based on Convolution Operation
The feature-embedding matrices of a node pair r i and s j in the heterogeneous graphs G che and G dis are Z che and Z dis , respectively.Z che and Z dis were stacked up and down to form the node pair feature-embedding matrix Z ∈ R 2 * 2 * N total of r i and s j .Z was fed to the convolution module to form the embedding of the primary capsule network.The convolution module contained one layer of single-group convolutional layers and one layer of multi-group convolutional layers.In the first convolutional layer, we applied a oneround zero-fill operation on Z to create a new matrix ∧ Z for learning the edge information.l f and w f were the length and width of the filter, respectively.If the number of filters was n f , the filter W conv1 ∈ R l f * w f * n f was applied to the matrix ∧ Z and the feature map where f is the relu activation function [32] and b conv1 is the bias vector.Z conv1,k (i, j) is the element of the i-th row and j-th column of the k-th feature map Z conv1,k .∧ Z(i, j) is the element of the matrix ∧ Z in row i column j.When the k-th filter slides to position ∧ Z(i, j), the region inside the filter is ∧ Z k,i,j , which can be calculated as, We build the w-group convolution in the second layer.Each group of convolution can be considered as a view of the feature map, and the attributes of the node pairs can be learned from multiple views.The filter size in each set of convolutions was W conv2 ∈ R 2 * 2 , and Z conv1 was fed to the second convolutional layer to form Z w conv2 ∈ R w * 2 * N total .

Creation of the Primary Capsule Layer
We encapsulated the value Z 1 conv2 (p), Z 2 conv2 (p), . . . ,Z w conv2 (p) of the p-th(p = 1, 2, . . ., 2 * N total ) position on the w feature maps Z 1 conv2 , Z 2 conv2 , . . ., Z w conv2 into a capsule to form u p ∈ R w .This capsule contained information regarding multiple views in the local area when the filter was slid into the p-th position of the feature map Z conv1 .The primary capsule layer contained [2 * N total ] capsules of w-dimensional vectors.

Design of Capsule Layer Routing Mechanism
We used primary and digital capsule layers to build the MVF module.The digital capsule layer consisted of n qn n qd -dimensional prediction capsules v q (q = 1, 2, . . ., n qn ); all of these capsules received input from all of the primary capsules u p (p = 1, 2, . . ., 2 * N total ) of the previous layer.We implemented the delivery of location information from the primary capsule layer to the digital capsule layer by means of weights determined by the routing mechanism.First, u p was used to determine the correlation between the two layers by multiplying by the weight matrix W pq to obtain the vector as ûq|p ∈ R n pd , ûq|p = W pq u p .(23) ûq|p was fed into the prediction capsule v q based on the coupling coefficients c pq as determined by the dynamic routing process, which were proportional to the weights of the features.We performed a dynamic routing process n dr times to compute c pq .We first initialized the weight b pq = 0 between capsule p and capsule q.Next, the coupling coefficient c pq was obtained by normalizing the weights b pq with So f tmax and the output vector o q was generated by weighted summation; c pq and o q are represented as, The modulus lengths of o q1 and o qn pn were used as the uncorrelated and correlated fractions between r i and s j , respectively.o q was employed after a nonlinear compression function to produce an output capsule v q as, where the value of the modulus length v q is between 0 and 1.The update rules for b pq are as follows: where denotes the dot product operation of two vectors.The routing mechanism is completed once after updating b pq .After n dr updates, the coupling coefficients c pq are finally determined and the final prediction capsules v f in q are formed.The modulus length of each vector is passed through the So f tmax layer to obtain the associated probability distribution y q N MF as, The prediction scores were evaluated based on the modulus length and the scores MVF were associated with probability distributions, including the probabilities that the drug-side-effect node pair was not associated and that they were associated.

Final Integration and Optimization
The cross-entropy between the true label z and predicted association probability y CET was defined as the loss function when the prediction is based on the node neighbor context encoding, as follows, where N train is the number of training sample sets.The predicted results are classified as relevant and irrelevant (c = 2).The true label z i = 1(z i = 0) represents the true correlation (uncorrelated) between all drugs and side-effects.In the MVF module, the cross-entropybased loss LOSS MVF is defined as, We used the Adam algorithm [33] to optimize the loss functions LOSS CET and LOSS MVF .Finally, a weighted sum of y CET and y MVF was calculated to obtain the final predicted association score as y, where γ ∈ (0, 1) is a hyperparameter for adjusting the two knowledge contributions.

Parameter Settings and Evaluation Metrics
TCSD was implemented in the Pytorch framework using a graphics processing unit (Nvidia GeForce GTX 2080Ti).For the CET module, the number of neighbor nodes per class N t = N k = 10, the number of coding layers l e = 2, and the number of heads for the multiheaded attention was set as 8.The two encoding layers' output feature dimensionalities were 2400 and 2000.In the MVF module, the first convolutional layer included 64 filters, while the second layer had w = 8 groups of convolutions, the number of filters was 512, and the size of all the filter kernels was set to 2 × 2. The numbers of capsules in the initial and digital capsule layers were 4900 and 2, respectively.The dimensionality of each digital capsule was set to 32 and the number of routing mechanism iterations n dr = 3.The parameter γ at final fusion was set to 0.3.
Each prediction model's effectiveness was evaluated using five-fold cross-validation.The positive case samples were those where the drug-side-effect associations were known and the negative case samples were the unobserved associations.As a result, we obtained 80,164 known associations betweeen drug and side-effect and 2,887,772 unknown associations.All positive case samples were divided at random into five equal parts: four of each multiple were used to train the prediction model, whereas the rest of the positive case sample set was used for testing.Randomly chosen counterexamples were used for testing, with the remaining counterexamples being used for training an array of counterexamples equal to the amount of samples in the training set that were positive.
The evaluation metrics include the area under the receiver operating characteristic (ROC) curve (AUC) [33,34], the area under the precision-recall (PR) curve (AUPR) [35], and the maximum k recall.The ratio of known associations to unobserved associations was approximately 1:36; evidently, a significant category imbalance existed between them.Thus, the AUPR was also used to evaluate the predictive performance as being more informative than the AUC.We determined the top k ∈ [30, 60, . . . , 240]candidates' recall rates as another measure of the model performance because biologists typically select drug-side-effect pairs from among these candidates and perform further relevant experiments.

Ablation Experiment
We conducted a series of ablation experiments to evaluate the contribution of the CET module, MVF module, and neighborhood node category-level attention mechanism (NCA) (Table 1).First, we removed the attention mechanism that was utilized to fuse the neighbor context encodings of multiple types of neighbor nodes for the target node.We performed vector summation to obtain the context representation of the target node.Next, we trained each of the two modules (CET and MVF) to obtain the contextual representation and the pairwise attributes.The attribute vectors of a pair of drug and side-effect nodes were concatenated and then went through a fully connected network to obtain the association score.The complete model with the CET module, MVF module, and NCA obtained the highest AUC = 0.977 and AUPR = 0.351.In the absence of the CET module, the prediction performance decreased by 1.4% in the AUC and 14.2% in the AUPR compared to TCSD.In the absence of the rich local features obtained by the MVF module, the AUC decreased by 0.6% and the AUPR decreased by 9.7% relative to TCSD.Without the NCA, the contribution of the contextual encoding to improving the prediction performance was the largest; the main reason for this was that the Transformer-based encoding strategy can propagate the node properties between the drug and side-effect nodes, thereby learning the contextual information between nodes.The MVF module learns the second most important contribution of the node pair feature representation to the results and enriches the local information of the node pairs in the process of building capsules.Accordingly, the routing mechanism can better learn the importance of the capsules.

Comparison with Other Methods
The six most advanced approaches were compared to our model (TCSD) in order to anticipate the drug-side-effect associations: GCRS [26], idse-HE [27], SDPred [25], Galeaon's method [21], random walk-signed heterogeneous information network (RW-SHIN) [19], Ding's method [17] and feature-derived graph regularized matrix factorization (FGRMF) [20].For a fair comparison, the hyperparameters of each model were set with the same parameters as suggested in each study.The training and testing time of TCSD and the compared methods are listed in the Supplementary Table S2.
For each drug, we calculated the corresponding AUC and AUPR in each multiple and then took the average value for the five-fold crossover as the final prediction result.The average values of the AUC and AUPR for 708 drugs were taken as the prediction performance of the entire method.As shown in Figure 4, TCSD obtained the highest AUC of 0.977, i.e., 0.9% and 2.0%, respectively, higher than idse-HE and GCRS, 3.1% and 3.2% better than SDPred and Ding's method, respectively, 5.8% higher than FGRMF, 6.5% better than Galeaon's method, and 8.5% higher than RW-SHIN, the worst-performing method.For the mean AUPR of all drugs, TCSD obtained the best mean AUPR value of 0.351, i.e., 7.9%, 12.5%, 16.0%, 17.2%, 22.0%, and 25.2% higher than the values from the above methods, respectively.Idse-HE did not perform as well as our method-the possible reason is that it ignored the semantic information of the various connections in the heterogeneous graph.Our approach and GCRS both achieved good performance, primarily because we built multiple heterogeneous graphs and built an independent learning module for each heterogeneous graph.This suggests that separately learning the topological information specific to each heterogeneous graph is necessary for improving the prediction accuracy.SDPred, which is based on a multi-layer perceptron, and Ding's method, which is based on central kernelaligned multicore learning, both scored lower than GCRS.One possible reason for this is that both methods do not consider the topological structure in the drug-side-effect heterogeneous graphs.In addition, FGRMF and Galeaon's method had similar AUC and AUPR values, with somewhat worse performance than the fourth-best, Ding's method.One possible reason is that both are shallow prediction models constructed using matrix decomposition-based methods; these cannot dig deeper into the complex connections between drugs and side-effects.The performance of RW-SHIN was inferior to the other methods because it only builds a network of drug nodes without considering the topological information between side-effect nodes.
For the 708 AUCs (AUPRs) results for all prediction methods for the 708 drugs, we used 708 paired results for comparing TCSD with another method as calculated using pairs of Wilcoxon tests.With a p-value threshold of 0.05, the data demonstrated that TCSD significantly outperformed the other six approaches (Table 2).For the top k drug candidates with side-effects, a higher recall indicates that more real drug and side-effect associations are included in these candidates.Our TCSD model consistently outperformed other methods at different k thresholds and ranked 50.3% of the positive cases in the top 30 candidates, 65.4% in the top 60, 73.0% in the top 90, and 78.1% in the top 120.GCRS has higher recall rates than idse-HE for the top 30 and 60 candidates.The former ranked 47.0% and 59.6% positive samples, while the latter ranked 42.1% and 58.1%, respectively.Idse-HE achieved slightly higher recall rates than GCRS for the top 90, 120, and 240 candidates.Idse-HE ranked 67.1% and 73.9% for the top 90 and 120 candidates, while GCRS ranked 66.8% and 71.9% (Figure 5).The AUC value of GCRS was very close to that of SDPred, but all of the recall rates of GCRS were higher than those of SDPred.When k was increased from 30 to 120, the SDPred ranked 41.8%, 54.9%, 62.3% and 67.4%, respectively.Ding's method was not as good as SDPred, with corresponding recall rates of 35.5%, 48.2%, 56.3%, and 62.2%, respectively.The recall rates of FGRMF (32.8%, 45.2%, 52.5%, 58.1%) were slightly higher than those of Galeaon's method (32.3%, 43.6%, 51.7%, 56.8%).The lowest recall rates were obtained by the RW-SHIN method with recall rates of 23.7%, 34.3%, 41.3% and 47.2%, respectively.

Case Studies on Five Drugs
According to the world mental health report in 2022, nearly one billion people across the World suffered from mental diseases.Therefore, to further demonstrate TCSD's ability to predict drug-side-effect associations, we analyzed five psychotropic drugs, including Amitriptyline, Olanzapine, Clozapine, Aripiprazole, and Asenapine.First, using the model, we were able to obtain association scores for each drug candidate side-effect and ranked them accordingly.Then, the top 15 potential side-effects for each drug were compiled and analyzed.The results are listed in Tables 3-7.MetaADEDB is a comprehensive repository of clinically reported adverse drug events (ADEs) containing 744,709 associations between 8498 drugs and 13,193 ADEs [38].Rxlist is a searchable database of more than 5000 drugs that have appeared in physician articles and authoritative websites, such as U.S. Food and Drug Administration (FDA)-related side-effects, drug safety issues, and other bases of prescribing information [39].Drug Central collects information on the structure, pharmacological effects, and indications of active drug ingredients approved by the FDA and other regulatory agencies, as well as on ADEs [40].SIDER is a database of marketed drugs and their adverse reaction records, covering 5868 side-effects and 139,756 pairs of associations between 1430 drugs [28].As shown in Table 3, 12 candidates are supported by Drug Central, 14 are included in MetaAD-EDB, and the Rxlist and SIDER databases also contain 14 candidates, respectively.Table 4 lists the candidates of the drug Olanzapine, and 12, 12, 15, and 15 candidates are recorded in the databases Drug Central, MetaADEDB, Rxlist, and SIDER, respectively.In addition, the constipation and vomiting of patients after they have taken the drug was confirmed by the literature [36].We labeled these two candidates with "Literature" and added them in Table 4.As shown in Tables 5 and 6, in terms of the drugs Clozapine and Aripiprazole, each of these two drugs has 13 candidates in Drug Central.There are 12 candidates and 15 in MetaADEDB, while Rexlist contains 12 candidates, and SIDER includes 13 candidates.In addition, dizziness and blurred vision appeared with high chance after the drug was used over 3 months [37].The side-effect "Blurred vision" was labeled with "Literature" in Table 5.Similarly, the drug has 2, 7, 12, and 10 candidates in the four databases, respectively.Thus, TCSD has the ability to identify potential drug-related side-effect candidates.It can screen reliable candidates for biologists to undertake subsequent wet-experiment studies to determine the actual associations.

2 Figure 1 .Figure 2 .
Figure 1.Framework of the proposed TCSD prediction model.(a) Establish two drug-side-effect graphs according to two types of drug similarities and demonstrate their attribute matrices (b) Learn the context representations of the drug and side-effect nodes based on a graph transformer and two attentions (c) Construct the capsule network to learn the multi-view pairwise attributes.
ρ,l−1 i and c ρ,l−1 m are the results of the layer l − 1 encoding of r i and its neighbors, respectively, and c ρ,0 i and c ρ,0 m are their attribute vectors x ρ i and x ρ m , respectively.
are the weight and bias vectors, respectively.The normalized attention score is calculated as β ρ,l

Figure 3 .
Figure 3. Explanation of learning pairwise multi-view features of drug-side-effect node pair with capsule networks.

Figure 4 .
Figure 4. ROC curves and PR curves of our method and the compared methods for drug-side-effect association prediction.

Figure 5 .
Figure 5. Recall rates of all the prediction methods at various top k values.

Table 1 .
Performance demonstration of the ablation experiments.

Table 2 .
Results of the Wilcoxon test by comparing TCSD and the other six methods.

Table 3 .
Top 15 candidate side-effects related to Amitriptyline.

Table 4 .
Top 15 candidate side-effects related to Olanzapine.

Table 5 .
Top 15 candidate side-effects related to Clozapine.

Table 6 .
Top 15 candidate side-effects related to Aripiprazole.

Table 7 .
Top 15 candidate side-effects related to Asenapine.