Graph Neural Networks with Multiple Feature Extraction Paths for Chemical Property Estimation

Feature extraction is essential for chemical property estimation of molecules using machine learning. Recently, graph neural networks have attracted attention for feature extraction from molecules. However, existing methods focus only on specific structural information, such as node relationship. In this paper, we propose a novel graph convolutional neural network that performs feature extraction with simultaneously considering multiple structures. Specifically, we propose feature extraction paths specialized in node, edge, and three-dimensional structures. Moreover, we propose an attention mechanism to aggregate the features extracted by the paths. The attention aggregation enables us to select useful features dynamically. The experimental results showed that the proposed method outperformed previous methods.


Introduction
Each molecule has its unique chemical properties. Estimation of the chemical properties is the first step in the field of drug discovery. Reagent testing is a standard estimation method. However, its process requires long time and equipment cost. Machine learning methods have been widely studied to reduce time and cost.
Most machine learning methods transform molecules into feature vectors and estimate chemical properties using a neural network. There is a high correlation between molecular structure and chemical properties. For example, molecules with benzene rings have a sweet aroma and flammability, and hydroxy groups (OH groups) are readily soluble in water. Therefore, feature extraction of molecular structures is essential in the estimation of chemical properties using machine learning. Since chemical properties depend on an essential structure, a flexible feature extraction method is necessary. A general feature extraction method is Molecular Fingerprints [1][2][3][4], which transform a molecular structure into a one-hot vector of the presence or absence of specific structures designed by humans. However, the specific structures are hard for modification according to chemical properties since experts need to change the specific structure of Molecular Fingerprints.
Recently, feature extraction using graph convolutional neural networks [5] has been attracting attention as a learnable feature extraction method. As shown in Figure 1, the graph represents the molecule using nodes (atoms) and edges (bonds). Node features are extracted by updating their features and neighboring node features. The node feature propagates to further nodes by the number of update processes. Besides, the update is based on a neural network. Thus, a feature extraction model can learn the essential substructures in a molecule according to the chemical characteristics of the estimation target. Various models using graph convolutional neural networks have been developed [6,7]. The weave model [6] extracts edge features to consider relationships between nodes. The 3DGCN model used relative coordinates between nodes to extract features of three-dimensional structures [7]. Graph convolutional neural networks worked well in a classification problem, such as active or inactive. However, there is room for improvement in the regression problem due to its extensive estimation range. Furthermore, substructures can be different for target properties. Thus, it is essential to consider multiple structural features simultaneously.
In this paper, we propose a method for chemical property estimation of molecules using multiple structural features. Specifically, we integrate feature extraction paths that consider nodes, edges, and three-dimensional structures, respectively. For more flexible feature extraction, we utilize an attention mechanism to select useful features dynamically.

Related Work
In the estimation of chemical properties by machine learning, the estimator uses feature vectors extracted from molecules. Molecular Fingerprint is a method for extracting feature vectors from molecules [1][2][3]. This method uses a one-hot vector to represent the presence or absence of human-designed molecular structures. An improved method is Extended Connectivity Fingerprints, or ECFP [4]. ECFP extracts the presence or absence of subgraphs within molecular radius as a feature vector. However, these methods only consider pre-designed molecular structures and, thus, cannot extract features according to the chemical properties of the target.
A flexible feature extraction method has been developed using machine learning. Duvenaud et al. used neural networks to refine the features of ECFP [8]. Recently, the graph convolutional neural network [5] has attracted much attention. Graph convolutional neural networks sequentially update node features using the features of their neighborhood nodes. Finally, all the node features are merged into a one-dimensional feature vector, resulting in a feature vector of a molecule. In addition to estimating molecular properties, graph convolutional neural networks are used in a wide range of fields, including language processing [9][10][11], human motion estimation [12][13][14], graph similarity estimation [15,16], and class identification [17,18].
For the estimation of chemical properties, various models exist [19][20][21][22][23][24]. Directed graphs are used to reduce computation and update node features [22,23]. Edge features are extracted in Reference [6,25]. Relative coordinates between nodes are used to extract features of three-dimensional structures [7]. There are methods that learn the importance of node features [26,27]. However, the aforementioned methods specialized in a specific molecular structure, such as edge and three-dimensional structures. In this study, we propose to integrate three feature extraction methods [5][6][7] to simultaneously extract multiple molecular structures. Furthermore, we dynamically select features using an attention mechanism to improve the estimation performance.

Materials and Methods
We propose a graph convolutional neural network that integrates three different approaches of feature extraction. Depending on the chemical properties of the estimation target, we need to extract different features. Therefore, we simultaneously extract node features, edge features, and three-dimensional features. Furthermore, we use attention to calculate the importance of each feature dynamically. As shown in Figure 2, the proposed method extracts features using multiple paths (node features, edge features, and threedimensional features) and aggregates each feature. Firstly, we extract features through each path. Then, we form a molecular feature by aggregating the features. The proposed method enables us to consider various structures of the molecule by extracting features through multiple paths. Overview of the proposed method. For simplicity, we illustrated the flow of feature update for a single attention node (green circle). Firstly, we generate pair features (yellow triangles) representing the relationship between the node and its neighbors. We use bond types and relative coordinates to extract edge relationships and three-dimensional structures. Then, we update the node features using the pair features (orange, yellow, and blue circles). Finally, we aggregate each path's features of the attention node to obtain the node features, which are the output of this layer (purple circles). We repeat the above processes for feature updates.

Node Feature Extraction Path
This path extracts node features using relationships between nodes. Let H t i ∈ R M×1 represents a M-dimensional feature vector of node i at t-th update round. We produce the pair feature P ij ∈ R M×1 between node i and j by Equation (1), where σ represents the rectified linear unit, and is the concatenation operation. A weight W np ∈ R M×2M and a bias B np ∈ R M×1 are learning parameters. Subsequently, we update the node features H i as Equation (2). N(i) is the set of neighboring nodes of node i. The weight is W n ∈ R M×M , and the bias is B n ∈ R M×1 .

Edge Feature Extraction Path
The extraction path of edge features takes into account the edge relationships between nodes. The atoms in a molecule can have various bonds, such as single bonds and double bonds. We incorporate these bond types into the feature extraction to consider the molecular structure's connectivity.
We use the five bonds: single bonds, double bonds, triple bonds, aromatic bonds, and bonds to themselves. Let N represent the number of atoms. We represent the bonds using the edge parameter E ∈ R N×N to describe the connectivity types. A naive parameter for E is using categorical values, such as 1 for the single bond. Inspired by Reference [25], we learn parameters to represent the bonds rather than merely representing the bonds using five categorical values. As shown in Figure 3, we create five adjacency matrices and learn the edge parameters using convolutional filter. The convolution filter has a kernel size of 1, and the number of channels is 5. We obtain a pair feature P ij by Equation (3) using E ij , the (i, j)th element of E. Note that E ij is a scalar value. Then, we update the feature H i by Equation (4). The learning parameters are weight W ep ∈ R M×2M , W e ∈ R M×M , bias B ep , B e ∈ R M×1 . We take the molecular bonds into account by multiplying E and the paired features.

Three-Dimensional Feature Extraction Path
We incorporate three-dimensional structural information into feature updates based on Reference [7]. Let (x i , y i , z i ) represent the absolute coordinate of node i, we calculate the relative coordinate R(x) ij = x i − x j of the x-coordinate. Likewise, we obtain relative coordinate in y and z, R(y) ij = y i − y j and R(z) ij = z i − z j .
We calculated the pair feature P ij using the relative coordinates R as defined in Equation (5). Then, we obtain the intermediate feature Q i by accumulating the pair features as in Equation (6). However, Q i exclude node feature H i due to R ii = 0. Therefore, as shown in Equation (7), we propose to concatenate H i and Q i .
There is a drawback in relative coordinates. The difference of relative coordinates is affected by translation and rotation. For further improvements, it is promising to use distance between atoms.

Feature Aggregation
We propose to extract more useful features by merging the features extracted through the paths. We integrate the three features using attention to dynamically select important features for each node. We integrate the features as Equation (8), where H node , H edge , and H 3d represent features extracted by the paths. Where α node i represents an attention for H node i at node i, which is defined in Equation (9). We used the softmax function to obtain α.
Inspired by Reference [26], we calculate e node i , e edge i , e 3d i for each feature by Equation (10). We use the initial feature H init i of the node i and H p , p ∈ {node, edge, 3d}.

Details of the Proposed Model
We illustrated the structure of the proposed model in Figure 4. The proposed model extracts features using the paths and aggregation, which are composed of graph convolutional neural networks. Then, we sum up the features along to each dimension to produce a molecular feature vector. Finally, we estimate chemical properties by applying a fully connected layer. We adopted two-stage training. Specifically, we independently trained each path. Then, we fixed the paths and trained the aggregation layer and the fully connected layer. We used the mean square error (MSE) loss for training. We followed Reference [7] to determine initial features, resulting in 60 dimensions feature vectors. The batch size was set to 16.

Datasets and Metrics
We mainly used two datasets in the experiments: Freesolv and ESOL. Each of these datasets has been compiled in Reference [28] and is widely used as a dataset to evaluate methods for estimating chemical properties. Freesolv is a dataset for estimating the free energy of hydration of molecules and contains 1128 molecules. ESOL is a dataset for estimating solubility and contains 643 molecules. Overall, Freesolv and ESOL are regression task, which directly predicts the values. Besides, we used four datasets for verification of the proposed method. We summarized the datasets in Table 1. QM8 has four excited state properties calculated by three different methods. Thus, 12 properties in total. We randomly split the dataset to 8:1:1 for training data, validation data, and test data. We evaluated the proposed method and the comparison methods for 10 trials. We calculated the average of the metrics over the trials. The evaluation metrics is Mean Absolute Error (MAE). The smaller MAE is better.

Comparison Methods
As comparison methods, we used the graph convolutional neural network (GCN) [5], the Weave model [6], and the 3DGCN [7]. Broadly, the comparison methods extract node features (GCN), edge features (Weave), and three-dimensional features (3DGCN), respectively. We set the number of updating layers to two in the proposed method and the comparison methods for equivalent comparison. In addition, the summation is used for producing molecular features as same as the proposed method. The main difference between the comparison methods and the proposed method is the number of feature extraction paths. The comparison methods have a single path. In contrast, the proposed method has multiple paths to consider node, edge, and three-dimensional structure simultaneously.

Main Results
We trained models until they converged. We stopped training if the loss is no longer improving for ten successive epochs. We defined no improvement if improvements are less than 0.0001. Figure 5 shows typical loss curves of the proposed method in training and validation. The loss curves of the validation also converged. Thus, there was no overfitting. The models successfully converged. The proposed model has 143,286 parameters. Compared to 135 M and 11.4 M parameters in VGG-16 and ResNet-18 models, the number of parameters is significantly small. Therefore, the numbers of data points in Freesolv and ESOL are satisfactory for the proposed method.
We showed the numerical results in Table 2. GCN was better among the comparison method. Moreover, the proposed method outperformed the comparison methods on all datasets. The proposed method successfully learned the essential features. Thus, the results showed the effectiveness of the multiple feature extraction for chemical property regression.

Results on Quantum Mechanism
We conducted experiments using QM7 and QM8 datasets. We trained models until they converged. The results are shown in Table 3. The results show that the proposed method outperformed the comparison methods at ten tasks. In addition, the proposed method was the second-best on the rest tasks. Thus, we verified the effectiveness of the proposed method in various tasks. Table 3. Averages of MAE over 10 trials on quantum mechanism datasets (Bold and underline are the best and the second-best, respectively).

Evaluation on Aggregation Approaches
We carried out experiments to discuss the effectiveness of the feature aggregation. Besides the attention, there can be various aggregation approaches, such as concatenation, summing, and maximum. We define them in Equations (11)- (13). Table 4 shows the results. The attention and the concatenation aggregations were the best on Freesolv and ESOL, respectively. All the aggregations achieved accurate estimation results. Thus, the proposed method has capability to aggregate different features using various approaches.

Impacts of Feature Extraction Paths
We conducted experiments to clarify the impacts of feature extraction paths on the datasets. We built models with one-, and two-paths. Specifically, one-path models have a single path of node, edge, and three-dimensional features, respectively. The two-path models have node and edge paths, node and three-dimensional paths, and edge and threedimensional paths, respectively. We used the attention aggregation in the two-path models.
The results are shown in Table 5. In Freesolv, the two-path models outperformed the one-path models. Furthermore, the proposed model was the best at 0.639. Likewise, two-path models were superior to the one-path models on ESOL. Overall, multiple features were significant in chemical property estimation.

Results in Classification Tasks
We conducted classification experiments on BACE and BBBP. In addition, we trained models until they converged. We used four metrics: Accuracy, Recall, Precision, and F-score. Tables 6 and 7 show the results. The two path model of the proposed method achieved the best results on BACE at all metrics. In addition, the proposed method was the best at precision and the second-best at the other metrics on BBBP. The ROC curves in Figure 6 shows the significant performance of the proposed method. These results show that the effectiveness of the proposed method in a classification task.

Verification of Edge Parameters
We carried out experiments to verify the effect of edge parameter E, a learning parameter in Equation (3). We compared two modes on the basis of the edge path model. One used edge parameters learned by convolution with kernel sizes. The other adopted the fixed edge parameter, e.g., categorical values, one for the self node, 2 for single bonds, 3 for double bonds, 4 for triple bonds, 5 for aromatic bonds. We adopted 50 epochs for training. The results are shown in Table 8. The model using the learned edge parameters improved on all the datasets. Therefore, the proposed method learned an optimal representation of edge types.

Effect of the Self Node in Three-Dimensional Features
We evaluated the effect of the self node in the three-dimensional feature. We used the model of the extraction path of three-dimensional features. Then, we compared the model with and without the self node. Specifically, we defined the model without self node as Equation (14) instead of Equation (7).
If we omit the self nodes, the self nodes were not considered when aggregating the pair features and updating the node features. The experimental results are shown in Table 9. There were specific improvements by the self node. Therefore, we confirmed that the performance could improve by incorporating the self nodes into three-dimensional features.

Attention Visualization
We conducted experiments to confirm the capability of dynamic determination for the attention values in the proposed method by visualizing each node's attentions α i . According to Equation (9), the summation of α i among the paths is normalized to one. Thus, we directly illustrated α using bar charts. The visualization results are shown in Figure 7. The various attention values were assigned to each node. The result shows that the proposed method flexibly determined attention for each node.

Conclusions
In this study, we proposed a method for chemical property estimation in molecules. The proposed method uses multiple paths to extract features focusing on specific structures, such as node relationship, edge relationship, and three-dimensional structure in a molecule. Furthermore, we proposed to obtain more useful features by aggregating multiple features by selecting essential features dynamically. Compared to existing methods that focus on only one structure, the experimental results showed that the proposed method outperformed the comparison methods in regression tasks. Therefore, multiple feature extraction can improve the performance of chemical property estimation in molecules.