Chemi-Net: A Molecular Graph Convolutional Network for Accurate Drug Property Prediction

Absorption, distribution, metabolism, and excretion (ADME) studies are critical for drug discovery. Conventionally, these tasks, together with other chemical property predictions, rely on domain-specific feature descriptors, or fingerprints. Following the recent success of neural networks, we developed Chemi-Net, a completely data-driven, domain knowledge-free, deep learning method for ADME property prediction. To compare the relative performance of Chemi-Net with Cubist, one of the popular machine learning programs used by Amgen, a large-scale ADME property prediction study was performed on-site at Amgen. For all 13 data sets, Chemi-Net resulted in higher R2 values compared with the Cubist benchmark. The median R2 increase rate over Cubist was 26.7%. We expect that the significantly increased accuracy of ADME prediction seen with Chemi-Net over Cubist will greatly accelerate drug discovery.


Introduction
The four essential processes of drug absorption, distribution, metabolism, and excretion (ADME) all influence the performance and pharmacological activity of potential drugs.Over the years, the experimental ADME properties of many compounds have been collected by the pharmaceutical industry, which have been used to predict the ADME properties of new compounds.As such, ADME property prediction can be particularly useful in the drug discovery process to remove compounds which are more likely to have ADME liabilities during downstream development.
Inspired by the huge success of deep neural networks (DNNs) in computer vision, natural language processing, and voice recognition, and based on their remarkable capability of learning concrete and sometimes implicit features 1 , we hypothesized that DNNs could be used in drug ADME property prediction.In this paper, we extend the use of traditional statistical learning methods and construct a multi-layer DNN architecture, named "Chemi-Net", to predict ADME properties of molecule compounds.
Applying DNNs to prediction of ADME properties has been previously reported by Ma et al. 2 , Kearns et al. 3 , and Korotcov et al. 4 , who all demonstrated accuracy improvements with DNNs over other traditional machine learning methods.However, the core challenge of ADME prediction using DNNs is that unlike images, which can usually be represented as a fixed-size data grid, molecular conformations are generally represented by a graph structure.This structured format is heterogeneous among molecules, which is a major problem for many learning algorithms that expect homogeneous input features.Several methods have been developed to alleviate this problem.Previous research mainly focused on transforming the graph structure of molecules to a fixed size of feature descriptors.These descriptors can then be easily used by existing machine learning algorithms.Another method, which is popular, is the use of molecular fingerprints, such as those used in the Extended-Connectivity Fingerprints (ECFP) method 5 .This method encodes the neighboring environment of heavy atoms in a compound to a hashed integer identifier, with each unique identifier corresponding to a unique compound substructure.Using this method, a compound is described as a fixed-length bit string, with each bit indicating whether a certain substructure is present in the compound.Such fingerprint-based representation makes learning graph-structured molecules possible.Neural network-based methods with fingerprint inputs have also been developed following recent advances in deep learning techniques, which have been shown to significantly improve on current Random Forestbased models 2 .However, the fingerprint-based method suffers from a fundamental issue in that the space required for fingerprints can be very large.Hence, the resulting fingerprints are very sparse.Also, the information that fingerprints encode is also noisy.Consequently, these factors limit the performance of fingerprint-based representations.
Recently, there has been a growing interest in using neural networks to directly obtain a representation of a compound ligand before applying other layers of neural network to build the predictive models.These methods transform a molecule to a small and dense feature vector (embedding), which is friendlier to downstream learners.These methods use string-based representation of molecules 6 , a graph convolution architecture to model circular fingerprints 7 , and also the Weave module in which atom and pair features are combined and transformed through convolution-like filters 8 .
Studies to date, which have applied DNNs to ADME predications, have shown that multi-task deep neural networks (MT-DNNs) have advantages over traditional single-task methods 2,3 .For example, MT-DNNs takes advantage of neural networks' ability to allow use of a combinational model, which has predictive power for multiple activities, being simultaneously trained with data from different activity sets.The enhanced predictive power of MT-DNNs had not been clearly explained until Xu et.al, 9 found that a MT-DNN borrows "signal" from molecules with similar structures in the training sets of the other tasks.They also found that MT-DNN outperforms the single-task method if the different data sets share certain connections, and the activities across different sets have non-random patterns.
The potential application of MT-DNNs in pharmaceutical drug discovery is reviewed by Ramsundar et al. 10 .In this review, the authors confirmed the robustness of MT-DNNs and also suggested that MT-DNNs should be combined with advanced descriptors, for example, descriptors developed by graph convolutional methods to enhance the performance of a MT-DNN.
Our current application features a molecular graph convolutional network combined with the MT-DNN method to further boost prediction accuracy.To the best of our knowledge, there are no published studies that have used these combined methods.In addition, Chemi-Net implements a novel dynamic batching algorithm and a fine-tuning process to further improve the stability and performance of trained models.In this study, a large-scale ADME prediction test was carried out in collaboration with Amgen.The test involved five different ADME tasks with over 250,000 data points in total.The test was conducted in a restricted environment so that the evaluation was only carried out once on the testing dataset.Our findings showed significant performance advantages with Chemi-Net over existing Cubist-based methods.

Results and Discussion
MT-DNN method of Chemi-Net improves predictive accuracy comparing to Cubist Table 1 and Figure 1 shows the overall test set prediction accuracy comparison between Chemi-Net and Cubist.Performance of models developed by different algorithms is highly dependent on size of data set, type of endpoint, type of model, and molecular descriptors used.For all 13 data sets, Chemi-Net resulted in higher R 2 values compared with the Cubist benchmark.With the single-task method, larger and less noisy data sets yielded higher improvements than the smaller and noisier data sets.Additional accuracy improvement was further achieved with MT-DNN.CYP3A4, cytochrome P450 3A4; HLM, human microsomal clearance; PXR, pregnane X receptor.

Prediction performance and compound similarity
We hypothesized that, as with traditional machine-learning methods, deep learning performance is affected by similarities between the training set and test set.To investigate this further, the similarity of compounds within the training set and the similarity between training and test sets were calculated.Similarity was calculated using molecular fingerprints and the Tanimoto method 11 .The prediction models were challenged by the test sets, which contained newer compounds and novel chemotypes.For all 13 data sets, the average similarity within training sets was 0.878, and the average similarity between training and test sets was only 0.679.

Comparison between Chemi-Net's descriptors and Amgen's traditional property and molecular keys descriptors
Chemi-Net applies molecular graph convolutional networks to generate descriptors on a threedimensional (3D) level based on simplified molecular-input line-entry system (SMILES) strings.
Over the past 10 years, Amgen has used a set of 800 more "traditional" 1-dimensional (1D) and two-dimensional (2D) descriptors based on physical properties, molecular keys etc.In our current study, we compared the two sets of descriptors by using the same ST-DNN methods in Chemi-Net (Figure 6).Interestingly, the Amgen "traditional" descriptor set, and the Chemi-Net descriptor sets performed similarly in some data sets (i.e.Solubility [HCL and SIF], PXR [Subset 2, 5 and 6]).For large and relatively high-quality data sets (e.g.HLM, CYP3A4) Chemi-Net descriptor sets performed better than Amgen descriptor set.In contrast, for small and noisy data sets (e.g.PXR and bioavailability) the Amgen descriptor set performed better.CYP3A4, cytochrome P450 3A4; human microsomal clearance (HLM); HCl, hydrochloric acid; PBS, phosphate-buffered saline; PXR, pregnane X receptor.

Conclusion
In this proof-of-concept study, we report for the first time the use of applying a molecular graph convolutional network combined with the MT-DNN method (Chemi-Net) to predict drug properties in a series of industrial grade datasets.The major improvements of this method are two-fold.First, instead of relying on preset descriptors (features) as reported in previously reported studies 2 , it used a graph convolution method to extract features from the smile file of each compound.Second, the multi-task DNN method used further improved on the individual model, which is limited by the fewer data points in an individual dataset.Given the clear performance improvement across all assay types, we foresee the wider application of our approach in drug discovery tasks.In the model presented in this paper, the input molecule is represented by a set of distributed representations assigned to its atoms and atom pairs.Each atom and atom pair are assigned a dense feature map, with   defined as the feature map of atom , and  , defined as the feature map of atom pair  and .Typical atom features include atom type, atom radius, and whether the atom is in an aromatic ring.Typical atom pair features include the inter-atomic distance and the bond orders between two atoms.The input molecule is then represented by a set of atom features { 1 ,  2 , . . .,   } and atom pair features { 1,2 ,  1,3 , . . .,  −1, }.

Methods
After the input atom level and atom pair level features are assembled, they are combined to form a molecule-shaped graph structure.A series of convolution operators are then applied to the graph, each operator then performs a convolution operation, which transforms the atom feature maps.To enable position invariant handling of atom neighbor information, the convolution filters for all atoms share a single set of weights.The output of the convolution layers is a set of representations for each atom.The pooling step reduces the potentially variable number of atom feature vectors into a single fixed-sized molecule embedding.The molecule embedding is then fed through several fully connected layers to obtain a final predicted ADME property value.

Convolution operator
The convolution operator is inspired by the Inception 13 and Weave modules 8 .The overall convolution operator structure is depicted in Figure 8.The inputs of this operator are the feature maps of the atoms and atom pairs.In this operator, the feature map of each atom is updated by first transforming the features of its neighbor atoms and atom pairs, then reducing the potentially variable-sized feature maps to a single feature map by using a commutative reducing operator.Importantly, atom pair features are never changed throughout the process.The method used to update the feature map of each atom is the same for all atoms.They are formulated with shared weights to achieve position-invariant behavior.Hence this process can be viewed as the same convolutional operation seen in convolutional neural networks (CNNs), except that the convolution filter connections are dynamic instead of fixed.This operator is designed so that an arbitrary number of these operators can be stacked.As in DNNs, the increased number of stacking operators enables more complex structures of the molecule to be learned.A typical computation flow of a convolution filter is shown in Figure 8.The most important aspects of the filter are the transformation and reduction operators.
In the transformation step, feature maps of neighbors of an atom are transformed by a feedforward sub-network.For a neighbor atom  of central atom , the input feature map is the concatenation of atom feature   and the atom pair feature  , .The bias term is denoted as .
The input is transformed through one fully connected layer and a non-linearity function :  ,  = (  ({   ,  , }) +   ) After each neighbor atom of  is transformed, these feature maps are then aggregated and reduced to a single feature map.In this process, a commutative reduction function is used to keep the order-invariant nature of the input feature maps.A typical example of such a function is the element-wise sum function {⋅}, which for input vectors  1 ,  2 , . . .,   , the output vector  is defined as . Similarly, we define operator {⋅} for element-wise max and {⋅} for element-wise averaging.
Following these principles, a reduction operator is constructed to improve model quality, in which multiple kinds of reduction operations are performed simultaneously and their outputs are combined as shown in Figure 9: where The reduced feature map is then combined with the input feature map of atom  (Figure 8) to produce the final output.This enables the model to obtain feature maps from different convolution levels, which are more straightforward and easier to optimize than only using the reduced feature map 14 : In our experiments, the non-linearity  is the Leaky ReLU function with negative slope  = 0.01: For each convolutional layer, a batch normalization operation 15 is applied on all atom embeddings of the entire batch, to accelerate the training process.

Input quantization 19
The initial input of the atom level features   and pair level features   contains the entries listed in Table 2 and Table 3.

Multi-task learning
In ADME profiling in drug discovery, data sets of the same domain problem but different conditions, such as experimental settings, are usually found.For example, the aqueous equilibrium solubility of ligands in certain media (e.g.HCl), is correlated with those under different media (e.g.PBS), albeit they are not completely equivalent.A model targeting multiple related tasks will be much more powerful than independent models for each task.

Fine-tuning
Due to the noisy nature of stochastic optimization algorithms, the validation and testing accuracy of neural network models varies greatly for each epoch.Hence, to obtain a stable model with consistent predictive power, some form of post-processing and model selection will be needed.
In this paper, we provide a fine-tuning process, which combing model selection and ensemble to further improve the stability and performance of single-shot models.
The fine-tuning process works as shown in Figure 11.First, input data is trained by multiple network configurations consisting of different layer structures; then, several of the best performing models out of trained epochs of these models are selected, based on their validation accuracy.Finally, the embeddings and prediction results of these models are used by the finetuning algorithm to train a fine-tuning model, which ensembles these embeddings and produces a model with improved accuracy.
The outputted embeddings and prediction scores for each selected model give the input of the fine-tuning model.The ensemble model consists of several multi-layer perceptrons.Orderindependent reduction layers are also used to compress information from arbitrary number of models to a fixed size.After these embeddings and scores are transformed by the neural network, a final ligand embedding is produced.This embedding may be combined with an optional explicit feature vector to include any existing engineered ligand descriptors.The combined embedding is then transformed by a multi-layer perceptron to obtain the final predicted score.

Benchmark method: Cubist
Cubist is a very useful tool in analyzing large and diverse set of data, especially data with nonlinear structure-activity relationships (SARs) 16,17 .It is a statistical tool for generating rule-based predictive models and resembles a piecewise linear regression model 18 22 .Some of the descriptors such as Kier shape indices contain implicit 3D information.Explicit 3D molecular descriptors were not routinely used in this study to avoid bias of the analysis due to predicted conformational effects and speed of calculation for fast prediction.

Data sets
A large-scale test was performed on Amgen's internal data sets using five ADME endpoints and a total of 13 data sets selected for building predictive model.The five selected ADME endpoints were human microsomal clearance (HLM), human CYP450 inhibition (CYP3A4), aqueous equilibrium solubility, pregnane X receptor (PXR) induction, and bioavailability.For the CYP3A4 assay, two subsets were studied, which differed slightly with conditions.For the aqueous equilibrium solubility assay, three subsets were studied: hydrochloric acid (HCl), phosphate-buffered saline (PBS), and simulated intestinal fluid (SIF).For the PXR induction assay, six subsets were studied, which differed slightly with conditions.Across all ADME endpoints, the data sets used varied in quality and quantity.Generally speaking, PXR and bioavailability data sets were noisier than the data sets for the three other ADME endpoints.
The training set and test set were split in an approximate ratio of 80:20 (Table 4).To resemble real-time prediction situations, compounds were ranked with their registration data in chronological order.Newer compounds were selected in the test set.

Model training and test procedure:
The test set was used solely for testing purposes to avoid bias in the training procedure.The Caret package in R was used for the Cubist method.A 10-fold cross validation was applied to tune parameters (committee member and number of nearest neighbors).A Caret-implemented grid search was then used to select the best parameter set to produce final models, using the lowest root mean-squared error (RMSE) for testing.For Chemi-Net, the input SMILES were first converted to 3D structures using an internal molecular conformation generator.The resultant molecular graphs were then used for training and testing.An RMSE-based loss function was used for training the neural network.A standard neural network procedure using the Adam optimizer 23 was applied.Both the single-task and multi-task models were evaluated.The finetuning process was performed on all tests.
The Cubist benchmark calculation was performed in parallel on an internal CPU cluster.The Chemi-Net calculation was carried out on six Amazon Web Service (AWS) EC2 p2.8xlarge GPU instances.

Figure 1 :
Figure 1: Absolute (left panel) and percentage (right panel) R 2 improvement over Cubist using

Figure 2 .
Figure 2. Absolute (left panel) and percentage (right panel) of R 2 increase between ST-DNN and

Figure 3 .
Figure 3. Absolute (left panel) and percentage (right panel) of R 2 increase between ST-DNN,

Figure 4
Figure 4 shows the prediction performance in comparison to the similarity between training and

Figure 4 .
Figure 4. Prediction performance and similarity between training and test sets for all 13 data

Figure 5 .
Figure 5. Prediction performance of binned test set compounds by similarity for the solubility

Figure 6 .
Figure 6.Comparison between Chem-Net molecular graph convolutional network derived

7 .
Deep neural network-based modelConventional fingerprint and pharmacophore methods usually require that explicit features are extracted and trained, hence the forms of the fingerprints are often limited by human prior knowledge.Encouraged by recently-reported studies in which DNNs have been shown to surpass human capability in multiple types of tasks from pattern recognition to playing the game Go 12 , we decided to use a DNN architecture to develop an ADME property prediction system.The overall neural network architecture is shown in Figure This network accepts a molecule input 14 with given 3D coordinates of each atom.It then processes the input with several neural network operations and then outputs the ADME properties predicted for the input molecule.

Figure 7 .
Figure 7. Overall network architecture.The input is quantized as molecule-shaped graph

Figure 9 .
Figure 9.The reduction step in convolution filter.

Figure 10 ,
our MT-DNN model extends the single-task model in a joint learning setup.The embedding for each ligand is trained and then used to predict multiple-task scores simultaneously.When training, the loss functions of each task are summed to get the final loss function.Furthermore, the weight of individual tasks can be non-uniform.This is useful for scenarios which favor one task over other tasks.

Figure 10 .
Figure 10.Joint training model for multi-task learning.

Figure 1 :
Figure 1: Absolute (left panel) and percentage (right panel) R 2 improvement using Chemi-Net

Figure 2 .
Figure 2. Absolute (left panel) and percentage (right panel) of R 2 increase between ST-DNN and

Figure 3 .
Figure 3. Absolute (left panel) and percentage (right panel) of R 2 increase between ST-DNN,

Figure 4 .
Figure 4. Prediction performance and similarity between training and test sets for all 13 data

Figure 5 .
Figure 5. Prediction performance of binned test set compounds by similarity for the solubility

Figure 6 .
Figure 6.Comparison between Chem-Net molecular graph convolutional network derived

Figure 7 .
Figure 7. Overall network architecture.The input is quantized as molecule-shaped graph

Figure 9 .
Figure 9.The reduction step in convolution filter.

Figure 10 .
Figure 10.Joint training model for multi-task learning.
, except that the rules can overlap.Cubist does this by building a model containing one or more rules, where each rule is a conjunction of conditions associated with a linear regression.The predictive accuracy of a rulebased model can be improved by combining it with an instance-based or nearest-neighbor based model.The latter predicts the target value of a new case by finding a predefined number of most similar cases in the training data and averaging their target values.Cubist then combines the rule-