Interpreting the Mechanism of Synergism for Drug Combinations Using Attention-Based Hierarchical Graph Pooling

Simple Summary This paper introduces a novel graph neural network (a hierarchical graph pooling model), SANEpool, to effectively detect core sub-networks of significant genes for predicting the synergy score of drug/drug combinations in cancer. SANEpool successfully addresses the limitations of the un-transparency in the prediction process of previous computational AI models for drug synergy prediction, while providing the superior predictive performance than popular baselines on numerous drug-synergy prediction datasets. The success of SANEpool indicates that significant gene-gene interactions and gene-drug interactions play a crucial role in designing powerful deep learning models to provide accurate prediction and to reveal the mechanism of the synergy (MoS). Abstract Synergistic drug combinations provide huge potentials to enhance therapeutic efficacy and to reduce adverse reactions. However, effective and synergistic drug combination prediction remains an open question because of the unknown causal disease signaling pathways. Though various deep learning (AI) models have been proposed to quantitatively predict the synergism of drug combinations, the major limitation of existing deep learning methods is that they are inherently not interpretable, which makes the conclusions of AI models untransparent to human experts, henceforth limiting the robustness of the model conclusion and the implementation ability of these models in real-world human–AI healthcare. In this paper, we develop an interpretable graph neural network (GNN) that reveals the underlying essential therapeutic targets and the mechanism of the synergy (MoS) by mining the sub-molecular network of great importance. The key point of the interpretable GNN prediction model is a novel graph pooling layer, a self-attention-based node and edge pool (henceforth SANEpool), that can compute the attention score (importance) of genes and connections based on the genomic features and topology. As such, the proposed GNN model provides a systematic way to predict and interpret the drug combination synergism based on the detected crucial sub-molecular network. Experiments on various well-adopted drug-synergy-prediction datasets demonstrate that (1) the SANEpool model has superior predictive ability to generate accurate synergy score prediction, and (2) the sub-molecular networks detected by the SANEpool are self-explainable and salient for identifying synergistic drug combinations.


Introduction
Combinatorial drug therapy has been of crucial importance in modern clinical disease treatment and drug discovery [1].Synergistic drug combination can produce more beneficial combinatorial effects than each constituent, and the synergic behavior always allows the lower doses of the drugs in the combination relative to their individual potencies, thus reducing the induction of drug resistance [2] and overcoming the side effects [3][4][5] associated with the high doses of single drug usage.Hence, drug combination therapy provides a greatly promising avenue towards the treatment of the most dreadful multifactorial diseases, such as cancer [6][7][8][9], diabetes, and bacterial infections.In contrast to the synergism, the therapeutic efficacy of some drug combinations can be simply additive or even sub-additive.As such, there has been growing interest in investigating the synergy mechanism of drug combinations to distinguish the synergistic combinations from non-synergistic ones.
Frequently, the synergy of drug combinations is tested in pre-clinical model environments, such as high-throughput screening (HTS) instruments [10,11], where thousands of combinatorial experiments are simultaneously implemented under actionable hypotheses and conditions to profile the synergism.However, the testing space can be extremely massive due to the large amount of drug combinations, cell lines, dose choices, and patient samples, hence it can be impractical to traverse the whole testing space [12].Furthermore, the transition from some pre-clinical environments to the clinical practice sometimes can also cause failure [13].As such, various computational (AI) models [14][15][16][17] are developed to assist the synergy analysis of drug combinations.
Most computational AI models take massive omics data and chemical structure data as input and then adopt deep learning algorithms to predict the synergy score to determine the presence of the synergism.Several machine learning models, such as TreeCombo [16] and random forest [18], build ensemble trees to predict the synergy scores and have achieved impressive results.After that, numerous deep learning models were proposed in the domain to unleash the predictive power of neural networks.A large body of work, including DeepSynergy [17], MatchMaker [19], and CCSynergy [20], shows that wisely combining the drug profiles and gene expression profiles in specific cell lines as input features enables the vanilla multiple layer perceptron (MLP) to accurately predict the synergy scores of drug combinations, while TranSynergy [21] applies attention-based transformer architecture to boost the prediction performance.On the other hand, SDCNet [22] and DeepDDS [23] demonstrate that modeling the connections between drugs and genes can benefit synergy prediction and propose to encode the networks/graphs that consist of genes and drugs through graph neural networks (GNNs).
In addition to the predictive ability, the interpretability of deep learning models is desirable in real-world scenarios like the pharmacy industry and healthcare, as it allows us to incorporate human expertise in decision making to provide more robust conclusions.Currently, limited existing works provide interpretable predictions of drug synergy.Tran-Synergy [21] applies the post hoc interpretation framework [24] that computes the Shapley value of each gene through GradientExplainer and uses DeepExplainer to characterize its contribution to the final synergy prediction.Though post hoc interpretation mechanisms that produce interpretations after the model creation work well in some cases, the ante hoc interpretable model [25] is still missing in the domain of drug synergism analysis to inject interpretability from the beginning of the model design.Consequently, the objective of this paper is to develop deep learning models to generate accurate and interpretable synergic predictions, and we resort to the graph pooling methodology in graph neural networks (GNNs).
In recent years, GNNs have been the dominant architecture for analyzing graphstructured data, such as social networks [26,27], protein networks [28,29], circuit networks [30,31], etc.Most GNNs follow the neighborhood aggregation scheme that updates each node feature by propagating its neighboring node features to its current feature and have achieved impressive results on various graph learning tasks, ranging from node classification [32] and link prediction [33,34] to graph classification [35].In order to generate a subset of nodes or cluster of nodes for the prediction tasks, several graph-pooling models are proposed.DGCNN [36] proposes to sort nodes for pooling according to their structural roles within the graph.However, since it stacks multiple graph convolution layers to propagate information and then globally implement the graph down-sampling via a pooling module, the generated graph representation is inherently flat.In order to extract hierarchical graph representations, DiffPool [37] uses different GNNs to separately implement neighborhood aggregation and graph pooling, and it provides a framework to hierarchically pool nodes across a broad set of graphs.
Following this inspiration, we introduce a novel hierarchical graph pooling model, SANEpool (self-attention-based node and edge pool), for the interpretable drug synergism prediction task, which aims to reveal the drug combination synergism by systematically extracting the target gene sub-network that intrigues the synergic behavior.Various medical chemistry research has shown that cancer is driven by genetic and epigenetic alterations, many of which can be mapped into signaling pathways that control the survival and migration/invasion of cancer cells.As one previous signaling pathway analysis suggests [38], 89% of tumor samples had at least one driver alteration in one of ten cancer-related signaling pathways that is responsible for tumor development, while 57% and 30% had one and multiple potentially druggable targets, respectively.Another example [39] is that the drug combination of venetoclax and idasanutlin can generate antileukemic efficacy in the treatment of acute myeloid leukemia by inhibiting antiapoptotic Bcl-2 family proteins and activating the p53 pathway at same time.Thus, inhibited signaling targets analysis shows great potential of facilitating drug combination synergism discovery.Following this intuition, each SANEpool layer implements the standard graph convolution layer (GCN) to generate attention features that encode the gene (and drug) information as well as the topology information of the molecular network.Based on these attention features, the probability that a gene or an interaction (gene-gene interaction, gene-drug interaction) will cause the synergy performance is calculated, and then genes and interactions that are unlikely to influence the synergism of drug combination will be filtered out.The proposed model is composed of multiple SANEpool layers and will output the target gene sub-network for interpretable and robust synergy prediction.
We evaluate our SANEpool model on three popular drug-synergy-prediction datasets, which are constructed upon NCI ALMANAC [18], GDSC (Genomics-of-Drug-Sensitivityin-Cancer) [40], and O'Neil [5] experimental settings.Experimental results demonstrate that the SANEpool model achieves the current state-of-the-art performance for all datasets.Furthermore, through visualizations of the detected target gene sub-network of different cancer cell lines, we observe that the proposed model (SANEpool) can detect the salient target gene patterns that cause the synergic drug combinations, which reveals the synergism mechanism in drug combination discovery.

Graph Neural Networks
Graph neural networks (GNNs) have revolutionized the field of learning with graphstructured data and empirically achieved the current state-of-the-art performance in various graph learning tasks, ranging from node classification and link prediction to graph classification.Broadly, GNNs [35][36][37][41][42][43] follow a recursive neighborhood aggregation scheme where the node features from the neighborhood of each node are aggregated to update the node's feature.Such frameworks allows GNNs to capture the graph topology as well as node features, hence unleashing the representation learning ability among graphs.

Pan-cancer Biomarkers
The genotype-oriented therapies for pan-caner biomarkers have been approved by the US Food and Drug Administration.These biomarkers amplify our knowledge of genomic profiling across various malignancies by revealing the prevalence of certain oncogenic alternations, hence playing important roles in drug combination discovery.According to a previous study [44], 30% of recurrent alternations across tumor types from 10,000 patients with metastatic cancers are targetable, and various genotype-oriented therapies are detected based on genomic profiling.For instance, the neurotrophic receptor kinase (NTRK) family genes 1-3 were identified in various pediatric cancers.Then, a clinical trial of the Trk inhibitor larotrectinib demonstrated the antitumor activity and hence led to the usage of arotrectinib as treatment for cancers harboring NTRK fusions.

Machine Learning in Drug Synergy Prediction
The drug synergy analysis is beneficial as it provides a useful resource for novel predicted drug combinations.However, manually discovering the synergism in practice is still challenging due to the high cost and the limited number of synergistic drug combinations approved by the Food and Drug Administration.Hence, the computational model shows huge potential to find the mechanism of synergy (MoS) in a biologically meaningful manner.Currently, various computational models, ranging from unsupervised learning models [45][46][47] to supervised learning models [48,49], have been proposed for the purpose of predicting the synergy of drug combinations and have achieved expressive performance.Broadly, these computational models, such as DeepSynergy [17] and Matchmaker [19], take as input massive chemical descriptors of tested drug pairs and cell-line gene expression profiles and then use multi-layer-perceptron (MLP)-based deep learning models to predict the synergy score of drug combinations.Although these models effectively predict the synergy score of drug combinations, they are inherently not interpretable, while the interpretability is crucial for the real-world application.As the drug synergy has been reported to be largely determined by the biomolecular network topology [50], many deep learning models, such as DeepSignalingSynergy [14] and IDSP [51], incorporate the gene-gene interactions and gene-drug interactions into model design to allow the model make interpretable predictions that explain the underlying MoS.

Methodology
In this section, we introduce the proposed graph pooling model, self-attention-based node and edge Pool (SANEpool), on the basis of which we develop an interpretable graph neural network to detect the biologically meaningful gene sub-network for synergic and interpretable drug combination prediction.The key point of SANEpool is to contain the attention score (importance) of nodes and edges though the node features and the graph topology; then, the attention scores make it possible to filter out less important (less relevant) nodes and edges for prediction.In Section 3.1, we introduce the problem formulation of the interpretable drug prediction task.In Section 3.2, we develop the mechanism of SANEpool, and the overall interpretable model architecture is described in Section 3.3.The problem formulation and the architecture of SANEpool are illustrated in Figures 1 and 2, respectively.

Problem Configuration
In this work, we study the molecular networks of cancer drug combination therapies in an inductive manner, where the tested drug pairs are unseen during the training phase.The molecular networks contain drugs and genes in the signaling pathways.The objective is to predict the synergic score of each drug pair based on the molecular network.In order to make the prediction interpretable, SANEpool is proposed to detect the sub-gene network (i.e., the red gene nodes in Figure 1 which consist of a subset of genes in signaling pathways that are most relevant to the synergistic effect of the drug pair.Then, the detected sub-gene network provides insight into the molecular mechanism of resistant or sensitive responses to cancer drug combinations. Let G = (V, E) be the molecular network (graph), where V is the node set that contains gene nodes and drug nodes, E is the edge set that characterizes the interactions between nodes.For the notation convenience, we use A to denote the adjacency matrix of the graph.Since the molecular graph is inherently undirected and has no self-loop, adjacency matrix A is a symmetric matrix.We use X ∈ R n×h to denote the input node feature, where n is the number of nodes, and h is the dimension of input features.Hence, the graph can also be represented as the pair of the node feature and adjacency matrix such that G = (X, A).Furthermore, we use Z t ∈ R n×h t ) to denote the node representation in layer t, where h t is the dimension of node representation.

The Proposed SANEpool Model
The (self-)attention mechanism plays important role in various machine learning models, including natural language processing architectures [52][53][54], graph classification architectures [55], sequential prediction algorithms [56], adversarial learning models [57], etc.The attention mechanism allows input features to be the criteria for the attention itself [58] and thus can distinguish the relative importance between features during the information aggregation process.In order to incorporate the node features and graph topologies in the attention scores for the nodes and edge pooling, we follow the idea of SAGpool [59] and utilize a graph convolution layer to aggregate such information and to compute the attention feature matrix H t .
where Ã = A + I is the adjacency matrix with added self-loops, D is the corresponding diagonal degree matrix of Ã such that Dii = ∑ n j=1 Ãij , and f is the activation function.The matrix Θ t ∈ R h t ×h t+1 is the trainable parameter to coordinate the attention score of nodes and edges, where h t is the feature dimension in layer t.Various graph convolution layers [43,60,61] have been proposed; these GNN formulas can be used as substitution of Equation ( 1).All GNN layers follows the same information aggregation framework; hence, the extracted attention features H t contain information of node features as well as graph topologies.Next, we discuss how to compute the attention scores of nodes and edges based on extracted attention features H t .The node attention score is determined as the cosine of the angle between the attention feature vector and a trainable projection/parameter vector p t of size h t (Equation ( 2)).The attention score Att measures the probability that gene nodes will cause the synergism of drug combinations.Hence, we can sort node attention scores and adopt the top-k selection technique to hierarchically select the gene sub-network for the purpose of synergism prediction, and such a process is formulated as (2) The core idea of the proposed SANEpool is to filter out 'less important' genes (and connections), and then only kept genes (and connections) are used to predict the synergism of drug combinations.Equation (2) indicates that the attention score Att is normalized and has a value in the set [−1, 1].Thus, a gene with larger attention score will be more likely to be kept in the top-k selection process (i.e., Equation ( 3)), indicating that the gene has a higher probability to be selected to predict/interpret the synergism of drug combination.
In the top node selection process (i.e., Equation ( 3)), when k ∈ N, we adopt the top-k node selection method as DGCNN [36].On the other hand, we can also implement node selection method proposed by [62] to retain a proportion of nodes when k ∈ (0, 1].Based on selected node index idx, we can construct the graph downsampling as Zt+1 = Z t (idx, :) and A t+1 = A t (idx, idx).
For general graph learning problems, it can be difficult to find the reasonable predefined indexes for nodes in graphs, as the problem is equivalent to the graph isomorphism problem, which is known to be NP-hard.However, for gene networks, since each gene at most appears once in each network, we can universally assign each possible gene (that appears in at least one network/graph in the data set) a unique (pre-defined) index.For instance, we can collect all possible genes and lexicographically sort their names, and then the order of genes in the sorting operation can be used as the index.Consequently, the pre-defined index is equivalent to the gene.It does not matter if we change the predefined index system once these indexes can injectively distinguish genes and are consistent among gene networks.The main advantage is to reduce the space complexity.The gene network usually contains lots of genes (i.e., a very large n), then, to store all possible edge pairs in the layer t, we require a tensor T with the shape of R n×n×2×h t , where h t is the size of node representation vectors in layer t and usually is selected from the set {32, 64, 128, 256, 512}.Then an MLP is used to learn the edge attention matrix from T. In contrast, using pre-defined indexes can reduce the size of T to R n×n , and no MLP is needed in this step.Hence, we can significantly shrink the model/algorithm complexity.
Let G t+1 = (Z t+1 , A t+1 ).Since the proportion of retained information (attention score) of nodes in G t+1 are different, the connectivity strength between nodes can be different.Hence, we should also provide a consistent mechanism to characterize such bias.One intuitive approach is to apply the graph attention mechanism based on the extracted attention features H t : where the symbol indicates the concatenation operation.Due to the universal approximation theorem [63,64], such formulations can approximate any continuous function that measures the connectivity strength.However, a major limitation of this framework is the memory cost.For large-scale graphs, the memory usage to compute the attention score of edge may limit the practical ability of the proposed model.Luckily, the molecular network (graph) takes advantage that each gene node in the graph has a corresponding predefined index (i.e., the gene name), which serves as a canonical node order.Hence, we can directly model the interaction strength in each layer t through a trainable parameter matrix W t .The the edge weight in the subgraph G t+1 is trainable through the equation The overall architecture of the proposed interpretable model takes the hierarchical graph pooling structure [59,62].Figure 2 illustrates the overall architecture, and details are provided in Appendix E. The model stacks multiple SANEpool layers followed by a graph convolution layer to hierarchically extract a key sub-graph from the input graph.In other words, the proposed SANEpool layer is used to downsample the important sub-network (sub-graph).After the downsampling process, another GNN (graph convolution) layer is used to aggregate information based on sub-graph G t+1 (Equation ( 5)) to update the node representation.
where Z t [idx t , :] is the node representation matrix of the downsampled sub-graph in the t-th SANEpool layer.Then, the output of the last graph convolution layer is used for the prediction task.

Readout Mechanism
Inspired by IGMC [65], the proposed model takes the node representation of two drug nodes to make prediction.The graph convolution framework indicates that such node representations encode the enclose rooted subtrees around the drug nodes in the pooled graph, hence representing the relations and interactions between drugs and genes.Ideally, we hope that the readout phase should be invariant to the drug node order, as the same drug pairs always have the same clinical performance regardless their order.Let u 1 , u 2 be the output of two drug nodes; then, the readout layer follows the factorization decoder in decagon [19]: where D ∈ R t L ,t L is the trainable parameter matrix in the decoder, where t L is the dimension of node representations in the last graph convolution layer L. The parameter matrix D models the interaction effects between every two dimensions in drug representations u 1 and u 2 .
It can be shown that the above readout function is invariant to the order of drugs.First, since the vectors/embeddings of two drugs, u 1 and u 2 , are generated by GNN message passing layers, which are invariant to the order of nodes/vertices in the input graphs, u 1 , u 2 will not change if we permute the order of drugs (even the order between two drugs and genes).Second, Equation ( 6) uses a symmetric function to compute the synergy score based on u 1 and u 2 and thus is permutation-invariant to the order of two drugs.

Comparison to Other Graph Pooling Models
Both SANEpool and Sortpool [36] propose to sort nodes according to the structural role (i.e., 'importance') of nodes in the graph.However, DGCNN is inherently flat, while SANEpool aggregate information in a hierarchical way.Thus, SANEpool is capable of capturing more informative global features for the downstream prediction task.On the other hand, SANEpool and Diffpool [37] learn graph representation in a hierarchical way.However, Diffpool focuses on the relational analysis of node clusters, while SANEpool detects the critical sub-network based on the downstram tasks.Hence, SANEpool supports the pathway-based analysis in the drug combination prediction, thus providing interpretable results for healthcare.

Comparison to Other GNNs for Drug Synergy Prediction
The proposed SANEpool model, SDCNet [22], and DeepDDS [23] share the same motivation that utilizes GNNs to capture the useful relational information of drugs and genes in the drug synergy prediction.However, they use GNNs to extract different types of relational information.The proposed SANEpool model encodes the important interactions between drugs and target genes in cell lines; SDCNet learns the cell-line-specific drug interactions, while DeepDDS encodes the molecular graph of each drug, which is generated by RDKit [66] based on the drug's chemical structure.Consequently, compared to SANEpool, SDCNet and DeepDDS rely on more drug profile information.Furthermore, SDCNet is used for the simpler classification task which aims to predict the synergistic effects (0/1 classification) instead of accurately predicting the synergy score.

Experiments
We evaluate the predictive ability of the proposed SANEpool by comparing the accuracy of the estimated synergy score against popular baselines.Furthermore, to illustrate the synergism detected by SANEpool, we also implement visual analytics and statistical analysis to show that the proposed SANEpool can detect significantly different causal gene subnetworks for synergic drug combinations and non-synergic drug combinations in each cell line.Overall, these datasets take input molecular networks/graphs, which consist of drug combinations/pairs and genes in the target cell lines, where gene expressions are used as node features in the input molecular networks/graphs.The objective is to predict the score/synergy score of each drug/drug pair.In all datasets, the edges/interactions between drugs and genes are collected from the Drug-Bank database (version 5.1.5,released 3 January 2020) [67], while the edges/interactions between genes are collected from the KEGG (Kyoto Encyclopedia of Genes and Genomes) database [68] based on the physical signaling interactions from documented medical experiments.The synergy score corresponding to each drug pair is computed as the average combo-score [39] with different doses on a given tumor cell line.The difference between datasets is the source of drug combinations and signaling pathways.

NCI-DCD Dataset
The NCI-DCD ensemble genes from 46 well-known signaling pathways (45 "signaling pathways" + cell cycle) [69] in KEGG.Drug combinations are collected from the DrugBank database [67], whose target genes are included in the aforementioned 46 signaling pathways, and the combo-scores of drug combinations are available from the NCI Almanac dataset.We provide details of the 46 signaling pathways and 21 selected FDA-approved drugs in Appendix B. In summary, NCI-DCD contains 5658 graphs/networks.Each graph/network has 1364 genes and two drugs, while containing about 25,000 edges that connect genes and drugs.

O'Neil-DCD Dataset
In O'Neil-DCD, signaling pathway information is also formulated based on the gene expression data of 1047 cancer cell lines in the Broad Institute Cancer Cell Line Encyclopedia (CCLE) database [70].Drug combinations and their corresponding synergy scores are collected from O'Neil datasets [5], whose target genes are included in KEGG database [68].
In summary, there are in total 4637 graphs/networks, and each contains two drug nodes and 1823 gene nodes.

GDSC-SDD Dataset
In GDSC-SDD, signaling pathway information is formulated based on the gene expression data of 791 cancer cell lines in the Broad Institute Cancer Cell Line Encyclopedia (CCLE) database, while the corresponding drug/cancer-cell-line response data set are available in the Genomics of Drug Sensitivity in Cancer (GDSC) database.In the experiment, there are in total 16,761 graphs/networks, and each contains a drug node and 969 gene nodes.In the dataset, since there is only a single drug node in the input network/graph rather than a pair of drug nodes, we can not use Equation ( 6) as the readout function.Instead, we use a two-layer MLP that takes as input the learnt embedding of the drug node in this setting.
The three benchmark datasets (i.e., NCI-DCD, O'Neil-DCD, GDSC-SDD) contain different numbers of networks/graphs, where each drug or drug pair on each cell line represents one network/graph.In each network/graph, gene nodes use three features as the input/initial node features: gene expression value and two 0/1 indicators to indicate whether the gene is connected to two drugs, while drug nodes set all these values as −1.
Then, when two networks correspond to different cell lines, the same gene will have different gene expression values as the initial node feature in these two networks.Furthermore, each network contains a gene if and only if (1) its gene expression data is available and (2) the gene is available in the KEGG dataset to track the interactions between genes.Then, since NCI-DCD, O'Neil-DCD, and GDSC-SDD have different resources of cell-line-based gene expression data (NCI-DCD dataset uses gene expression data from its website, and the gene expression data were collected from CCLE for the other two datasets) and drug (pair)/synergy data, networks in the different datasets will contain different sets of genes.

Baseline Methods
In popular deep learning models for drug-synergy prediction, we select three popular baselines: DeepSynergy [17], DeepSignalingSynergy [14], and TransSynergy [21].Compared to other methods, DeepSynergy uses additional drug profile features.To make a fair comparison, we mask out additional input features with 0 embeddings.
We also compare SANEpool with six widely adapted GNN baselines.These baselines can be categorized into two types: (1) flat GNNs: Graph Attention Network (GAT) [71], Deep Graph CNN (DGCNN) [36], Graph Isomorphism Network (GIN) [72], and graph convolutional network (GCN) [61] and (2) popular graph pooling models: Diffpool [37] and SAGpool [59].For GCN, GIN, and GAT, we stack three graph convolution layers with 64 output feature channels and concatenate sum-pooled features from the layers to generate the graph representation, which is then passed to an MLP to predict the graph label.
In the SANEpool model, we keep 200 nodes in the last layer and keep 90% edges in each SANEpool layer.The SANEpool model takes two graph convolution layers and two SANEpool layers.To provide robust model performance, we perform five-fold crossvalidation and report the accuracy averaged over five folds and the standard deviation of validation accuracies across the five folds.

Predictive Performance
In the experiment, we demonstrate the effectiveness of SANEpool in predicting the synergy score of drug/drug combinations.Table 1 illustrates the experimental results.The experimental results indicate that SANEpool can accurately predict the synergy score of drug/drug combinations and achieve the state-of-the-art predictive performance among these competitive baselines.
Furthermore, (1) we find that SANEpool significantly improves the performance over flat GNNs (i.e., GIN, GCN, GAT, DAGNN), which indicates that hierarchical graph representation learning technique in molecular networks can provide informative graph embedding with biologically meaning.(2) SANEpool outperforms other hierarchical graph pooling algorithms, and this observation indicates that incorporating edge information in the graph pooling is a potential future direction in the molecular network analysis, where graphs always have thousands of high-centrality nodes.

Interpretability
The interpretability of deep learning models has been a major limiting factor for the use of these models in real-world drug-combination synergy analysis since most usage cases require explanations of the features used in the model.There is a natural trade-off between the interpretability and the accuracy of decision models in application, and hence it is critical to find the balance between them.Currently, there are multiple deep learning models which take massive drug chemical structure information and predict the synergy score in a fully untransparent manner, such as DeepSynergy and MatchMarker.Although they can achieve expressive predictive performance, the lack of interpretability somehow limits the power of these models in real-world applications.Among the selected basslines, DeepSignalingSynergy is constructed based on the standard multiple layer perceptron model, and hence it is inherently not interpretable.Similarly, GCN and GIN follow the basic neighborhood aggregation framework that aggregates information from the neighborhood of each node and then updates the node feature of the node based on the aggregated feature and node feature itself, while such aggregation processes are not interpretable either.In contrast, attention-based deep learning models, such as GAT and TransSynergy, can provide interpretable conclusions.For instance, GAT provides interpretability by measuring the connection strength between genes and graphs in the input biomolecular graphs through the attention mechanism, and we can analyze the effect of drug nodes based the connection strength.Analogous to these interpretable models, in the next section, we will show that the interpretability of the proposed SANEpool model comes from its ability of computing the 'synergic importance' of each gene.The 'synergic importance' is computed as the expectation of its effect on the synergic score of drug combinations.Specifically, the effect can be measured by whether it is used in the synergic score prediction process (0/1 variable, i.e., whether the gene is selected by SANEpool model) or we can multiply the 0/1 variable with the synergy score.In this paper, we use the former definition.Hence, the 'synergic importance' can be used to detect genes with closer correlations to synergic drug combinations in each cell line by selecting genes whose 'synergic importance' is larger than a given threshold.

Statistical Analysis and Visualizations
In the experiment, we implement statistical analysis and visual analytics to reveal the interpretability of the SANEpool model.Here we use NCI-DCD as an example.Previous works [20,40] have emphasized that drug synergy is highly cell-line-specific/contextspecific.Hence, we perform the cell-line-specific analysis.For each cancer cell line, there are multiple drug combinations targeting the cell line, some of which are synergic, while others are not synergic.For each drug combination, SANEpool can select top k genes for the prediction.Hence, we define the synergic/non-synergic importance score of a gene as the proportion of the gene is used by the SANEpool model in the prediction when input drug pairs are synergistic/non-synergistic.For instance, if a gene node is never detected by any synergistic drug combination that targets on a specific cancer cell line, then its synergic importance score is 0 and is never used to predict the synergy score of synergic combinations.Figure 3 compares these scores computed by SANEpool, and it illustrates that the patterns of synergic importance scores and non-synergic importance scores across genes are different in each cell line.
Next, we should decide whether the detected gene sub-networks for synergic drug combinations are significantly different from those of non-synergistic drug combinations.In order to do so, we can compare the distribution of the synergic importance scores on genes and the distribution of the non-synergic importance scores.If these two distributions are significantly different, we can infer that the detected gene sub-networks are also significantly different.Hence, we implement the Kolmogorov-Smirnov test (K-S test) to compare these distributions.In the K-S test, the null hypothesis assumes that two distributions are the same, and it computes a D statistic as well as a p-value corresponding to the D statistic.Then, we reject the null hypothesis if the p-value is less than the significance level (0.05).We provide details of the K-S test and relevant statistics in Appendix C. We implement the cell-line-based test.The cell-line based K-S test results are provided in Table 2, and it shows that the K-S test is significant for each cell line.Consequently, we use the difference in the synergic importance score and non-synergic importance score of each gene to determine whether the gene is selected in the core gene sub-network.That is, the gene is obtained in the core gene sub-network if the computed value is above a given threshold, like 0.1.In addition to the statistical analysis, we also use heat maps to show the difference between detected sub-networks of synergic drug combinations and non-synergic drug combinations.Here, we provide examples on cell line SF-295 and cell line K-562 in Figure 3, and more examples are provided in Appendix A. In the heatmap, the values (synergy > 0, synergy < 0) assigned to each gene are the cell-line-based synergic importance and nonsynergic importance scores, which indicate the proportion that the gene is included in the gene sub-network (detected by SANEpool) of synergic drug combinations (non-synergic drug combinations) targeting the cell line.The synergy value of a gene measures the possibility that the information of the gene, such as the gene expression and gene copy number, contributes to the prediction process of the drug synergy score.Hence, the difference between cell-line-based synergy value (estimated probability that the gene is involved in the detected sub-network of drug combinations with synergy > 0) and cell-linebased non-synergy value (estimated probability that the gene is involved in the detected sub-network of drug combinations with synergy < 0) can reflect the synergic performance of the gene.That is, a larger difference indicates the gene contributes more to the synergic drug combination than non-synergic drug combinations.For instance, for cell line SK-562, the top 10 detected genes are SIN3A, ETS2, WNT10B, SLC8A1, MTOR, KLF2, RGS2, SESN3, NRG1, TNFRSF11A.Furthermore, these heatmaps also show that the difference between the detected genes of synergic drug combinations and of non-synergic drug combinations are significant.
Furthermore, we also visualize the interactions of drugs and genes in the detected core gene sub-network in Figures 4 and 5 to show that synergic drug combinations are more likely to target on the detected core gene sub-networks in each cell line.Figure 4 focuses on specific cell lines, while Figure 5 combines all cell lines and drug combinations.For each cell line, we plot all genes (i.e., red nodes) in the sub-networks (detected by SANEpool) of synergic drug combinations and then randomly sample synergic drug combinations (i.e., purple nodes) and some non-synergic drug combinations (i.e., blue nodes).Figure 4 illustrates that drugs in the non-synergic combinations are very unlikely to target on the core gene sub-network in each example cell line (e.g., SF-295, K-562).Figure 5 indicates that this observation can be extended to other cell lines.

Conclusions
In this paper, we have proposed an interpretable GNN architecture called SANEpool (self-attention-based node and edge pool) to predict the synergy score of drug combinations and to investigate the underlying mechanism of the synergy (MoS) by detecting salient molecular sub-networks.For each cell line and each drug combination, SANEpool can detect a specific sub-network, and SANEpool evaluates the contribution of each gene to synergic drug combinations based on all detected (cell-line-based) sub-networks.Hence, cell-line-specific essential signaling gene targets are identified by SANEpool.Furthermore, our observation also indicates that most synergistic drug combinations inhibit the core signaling network detected by SANEpool.The current work is limited by the number of drug combinations and cell lines.In future work, more drug combination datasets with multi-omic data will be integrated to uncover the mechanism of the synergy of effective drug combinations.

Figure 1 .
Figure 1.Overview of the problem formulation.The objective is to systematically detect gene subnetworks to predict the synergy score of the input drug pairs and to explain the drug combinatorial synergies across a large group of drug combinations and cell lines.

Figure 2 .
Figure 2. The proposed SANEpool layer and the overall architecture.SANEpool layer incorporates node features and graph topologies through a GNN layer, and then the output is used to compute the attention score of nodes and edges, on the basis of which we can detect important nodes and edges through top-k sorting or threshold filtering.The overall model takes a hierarchical graph pooling architecture.

Figure 3 .
Figure 3. Synergic/non-synergic importance scores of all genes in cell line SF-295 and cell line K-562.X-axis is the gene index.The same genes in different cell lines share the same index.

Figure 4 .
Figure 4. Visualization of drug-gene interactions on selected cell lines.In these graphs, red nodes represent genes in the subnetwork of all synergic drug combinations on the cell line, purple nodes are synergic drug pairs, blue nodes are randomly non-synergic drug pairs.

Figure 5 .
Figure 5.The figure describes the interactions of synergic drugs combinations and genes in the detected core gene sub-network.Synergic drug nodes are visualized as purple nodes, while red nodes represent genes in the detected core gene sub-network.
• W t (idx t , idx t ), where • denotes the Hadamard product operation.The advantage of this formulation is discussed in Appendix D.

Table 1 .
Performance evaluation.Best results are in bold.

Table 2 .
Cell -line-based K-S test results.