Abstract
Inference of chemical compounds with desired properties is important for drug design, chemo-informatics, and bioinformatics, to which various algorithmic and machine learning techniques have been applied. Recently, a novel method has been proposed for this inference problem using both artificial neural networks (ANN) and mixed integer linear programming (MILP). This method consists of the training phase and the inverse prediction phase. In the training phase, an ANN is trained so that the output of the ANN takes a value nearly equal to a given chemical property for each sample. In the inverse prediction phase, a chemical structure is inferred using MILP and enumeration so that the structure can have a desired output value for the trained ANN. However, the framework has been applied only to the case of acyclic and monocyclic chemical compounds so far. In this paper, we significantly extend the framework and present a new method for the inference problem for rank-2 chemical compounds (chemical graphs with cycle index 2). The results of computational experiments using such chemical properties as octanol/water partition coefficient, melting point, and boiling point suggest that the proposed method is much more useful than the previous method.
1. Introduction
Inference of chemical compounds with desired properties is important for computer-aided drug design. Since drug design is one of the major targets of chemo-informatics and bioinformatics, it is also important in these areas. Indeed, this problem has been extensively studied in chemo-informatics under the name of inverse QSAR/QSPR [1,2], where QSAR/QSPR denotes Quantitative Structure Activity/Property Relationships. Since chemical compounds are usually represented as undirected graphs, this problem is important also from graph theoretic and algorithmic viewpoints.
Inverse QSAR/QSPR is often formulated as an optimization problem to find a chemical graph maximizing (or minimizing) an objective function under various constraints, where objective functions reflect certain chemical activities or properties. In many cases, objective functions are derived from a set of training data consisting of known molecules and their activities/properties using statistical and machine learning methods.
In both forward and inverse QSAR/QSPR, chemical graphs are often represented as vectors of real or integer numbers because it is difficult to directly handle graphs using statistical and machine learning methods. Elements of these vectors are called descriptors in QSAR/QSPR studies, and these vectors correspond to feature vectors in machine learning. Using these chemical descriptors, various heuristic and statistical methods have been developed for finding optimal or nearly optimal graph structures under given objective functions [1,3,4]. In many cases, inference or enumeration of graph structures from a given feature vector is a crucial subtask in these methods. Various methods have been developed for this enumeration problem [5,6,7,8] and the computational complexity of the inference problem has been analyzed [9,10,11]. On the other hand, enumeration in itself is a challenging task, since the number of molecules (i.e., chemical graphs) with up to 30 atoms (vertices) C, N, O, and S, may exceed [12].
As in many other fields, Artificial Neural Network (ANN) and deep learning technologies have recently been applied to inverse QSAR/QSPR. For example, variational autoencoders [13], recurrent neural networks [14,15], and grammar variational autoencoders [16] have been applied. In these approaches, new chemical graphs are generated by solving a kind of inverse problems on neural networks, where neural networks are trained using known chemical compound/activity pairs. However, the optimality of the solution is not necessarily guaranteed in these approaches. In order to guarantee the optimality, a novel approach has been proposed [17] for ANNs with ReLU activation functions and sigmoid activation functions, using mixed integer linear programming (MILP). In their approach, activation functions on neurons are efficiently encoded as piece-wise linear functions so as to represent ReLU functions exactly and sigmoid functions approximately.
Recently, a new framework has been proposed [18,19,20] by combining two previous approaches; efficient enumeration of tree-like graphs [5], and MILP-based formulation of the inverse problem on ANNs [17]. This combined framework for inverse QSAR/QSPR mainly consists of two phases, one for constructing a prediction function to a chemical property, and the other for constructing graphs based on the inverse of the prediction function. The first phase solves (I) Prediction Problem, where a prediction function on a chemical property is constructed with an ANN using a data set of chemical compounds G and their values of . The second phase solves (II) Inverse Problem, where (II-a) given a target value of the chemical property , a feature vector is inferred from the trained ANN so that is close to and (II-b) then a set of chemical structures such that is enumerated. In (II-b) of the above-mentioned previous methods [18,19,20], an MILP is formulated for acyclic chemical compounds. Their methods were applicable only to acyclic chemical graphs (i.e., tree-structured chemical graphs), where the ratio of acyclic chemical graphs in a major chemical database (PubChem) is 2.91%. Afterward, Ito et al. [21] designed a method of inferring monocyclic chemical graphs (chemical graphs with cycle index or rank 1) by formulating a new MILP and using an efficient algorithm for enumerating monocyclic chemical graphs [22]. This still leaves a big limitation because the ratio of acyclic and monocyclic chemical graphs in the chemical database PubChem is only 16.26%.
To break this limitation, we significantly extend the MILP-based approach for inverse QSAR/QSPR so that “rank-2 chemical compounds” (chemical graphs with cycle index or rank 2) can be efficiently handled, where the ratio of chemical graphs with rank at most 2 in the database PubChem is 44.5%. Note that there are three different topological structures, called polymer-topologies over all rank-2 chemical compounds. In particular, we propose a novel MILP formulation for (II-a) along with a new set of descriptors. One big advantage of this new formulation is that an MILP instance has a solution if and only if there exists a rank-2 chemical graph satisfying given constraints, which is useful to significantly reduce redundant search in (II-b). We conducted computational experiments to infer rank-2 chemical compounds on several chemical properties.
The paper is organized as follows. Section 2.1 introduces some notions on graphs, a modeling of chemical compounds, and a choice of descriptors. Section 2.2 reviews the framework for inferring chemical compounds based on ANNs and MILPs. Section 2.3 introduces a method of modeling rank-2 chemical graphs with different cyclic structures in a unified way and proposes an MILP formulation that represents a rank-2 chemical graph G of n vertices, where our MILP requires only variables and constraints when the maximum height of subtrees in G is constant. Section 3 reports the results on some computational experiments conducted for chemical properties such as octanol/water partition coefficient, melting point, and boiling point. Section 4 makes some concluding remarks. Appendix A provides the detail of all variables and constraints in our MILP formulation.
2. Materials and Methods
2.1. Preliminary
This section introduces some notions and terminology on graphs, a modeling of chemical compounds, and our choice of descriptors.
2.1.1. Multigraphs and Graphs
Let and denote the sets of reals and non-negative integers, respectively. For two integers a and b, let denote the set of integers i with .
Multigraphs
A multigraph is defined to be a pair of a vertex set V and an edge set E such that each edge joins two vertices (possibly ) and the vertices u and v are called the end-vertices of the edge e, and let denote the set of the end-vertices of an edge , where an edge e with is called a loop. We denote the vertex and edge sets of a multigraph M by and , respectively. A path with end-vertices u and v is called a -path, and the length of a path is defined to be the number of edges in the path.
Let M be a multigraph. An edge is called multiple (to an edge ) if there is another edge with . For a vertex , the set of neighbors of v in M is denoted by , and the degree of v is defined to be the number of times an edge in is incident to v; i.e., . A multigraph is called simple if it has no loop and there is at most one edge between any two vertices. We observe that the sum of the degrees over all vertices is twice the number of edges in any multigraph M; i.e.,
For a subset X of vertices in M, let denote the multigraph obtained from M by removing the vertices in X and any edge incident to a vertex in X. An operation of subdividing a non-loop edge (resp., loop) with (resp., ) is to replace e with two new edges and incident to a new vertex such that each is incident to . An operation of contracting a vertex u of degree 2 in M is to replace the two edges and incident to u with a single edge removing vertex u, where the resulting edge is a loop when . The rank of a multigraph M is defined to be the minimum number of edges to be removed to make the multigraph acyclic. We call a multigraph M with a rank-k graph. Let denote the set of vertices of degree i in M. The core of M is defined to be an induced subgraph that is obtained from by setting repeatedly until contains at most two vertices or consists of vertices of degree at least 2. The core of a connected multigraph M consists of a single vertex (resp., two vertices) if and only if M is a tree with an even (resp., odd) diameter. A vertex (resp., an edge) in M is called a core vertex (resp., core edge) if it is contained in the core of M and is called a non-core vertex (resp., non-core edge) otherwise. The core size is defined to be the number of core vertices of M, and the core height is defined to be the maximum length of a path between a vertex to a leaf of M without passing through any core edge. The set of non-core edges induces a collection of subtrees, each of which we call a non-core component of M, where each non-core component C contains exactly one core vertex and we regard C as a tree rooted at . Let C be a non-core component of M. The height of a vertex v in C is defined to be the maximum length of a path from v to a leaf u in the descendants of v.
A multigraph is called a polymer topology if it is connected and the degree of every vertex is at least 3. Tezuka and Oike [23] pointed out that a classification of polymer topologies will lay a foundation for elucidation of structural relationships between different macro-chemical molecules and their synthetic pathways. For integers and , let denote the set of all rank-r polymer topologies with maximum degree at most d. Figure 1 illustrates the three rank-2 polymer topologies in .
Figure 1.
An illustration of the three rank-2 polymer topologies .
For a polymer topology M, the least simple graph of M is defined to be a simple graph obtained from M by subdividing each loop in M with two new vertices of degree 2 and subdividing all multiple edges (except for one) between every two adjacent vertices in M. Note that for the rank r of M and the number s of loops in M.
The polymer topology of a multigraph M with is defined to be a multigraph of degree at least 3 that is obtained from the core by contracting all vertices of degree 2. Note that . Figure 2a–c illustrate the least simple graph of each polymer topology , where Figure 2d illustrates a graph that contains all least simple graphs.
Figure 2.
An illustration of the least simple graphs of the rank-2 polymer topologies in Figure 1 and a scheme graph : (a) ; (b) ; (c) ; (d) a scheme graph where each edge is directed from one end-vertex to the other end-vertex with , and , and , and the edges in (resp., and ) are depicted with dashed (resp., dotted and solid) lines.
Graphs
Let be a graph with a set V of vertices and a set E of edges. Define the 1-path connectivity of H to be .
Let H be a rank-2 connected graph such that the maximum degree is at most 4. We see that H contains two vertices and such that either there are three disjoint paths between and or H contains two edge disjoint cycles C and , which are joined with a path between and (possibly ). We introduce the topological parameter of rank-2 connected graph H as follows. When H has three disjoint paths between and , define to be the minimum number of edges along a path between and . When H contains two edge disjoint cycles C and , which are joined with a path P between and (possibly ), define to be .
For positive integers and c with , let denote the rooted tree such that the number of children of the root is a, the number of children of each non-root internal vertex is b and the distance from the root to each leaf is c. In the rooted tree , we denote the vertices by () with a breadth-first-search order, and denote the edge between a vertex with and its parent by . For each vertex in , let denote the set of indices j such that is a child of , and denote the index j such that is the parent of when .
2.1.2. Modeling of Chemical Compounds
Chemical Graphs
We represent the graph structure of a chemical compound as a graph with labels on vertices and multiplicity on edges in a hydrogen-suppressed model. Nearly 68.5% (resp., 99%) of the rank-2 chemical graphs with at most 200 non-hydrogen atoms registered in chemical database PubChem have a maximum degree at most 3 (resp., 4) for all non-core vertices in the hydrogen-suppressed model.
Let be a set of labels, each of which represents a chemical element such as C (carbon), O (oxygen), N (nitrogen) and so on, where we assume that does not contain H (hydrogen). Let and denote the mass and valence of a chemical element , respectively. In our model, we use integers , and assume that each chemical element has a unique valence .
We introduce a total order < over the elements in according to their mass values; i.e., we write for chemical elements with . Choose a set of tuples such that . For a tuple , let denote the tuple . Set , and . a pair of two atoms and joined with a bond of multiplicity k is denoted by a tuple , called the adjacency-configuration of the atom pair.
We use a hydrogen-suppressed model because hydrogen atoms can be added at the final stage. a chemical graph over and is defined to be a tuple of a graph , a function and a function such that
- (i)
- H is connected;
- (ii)
- for each vertex ; and
- (iii)
- for each edge .
Let denote the set of chemical graphs over and .
Descriptors
In our method, we use only graph-theoretical descriptors for defining a feature vector, which facilitates our designing an algorithm for constructing graphs. Given a chemical graph , we define a feature vector that consists of the following 14 kinds of descriptors:
- -
- : the number of vertices in G;
- -
- : the core size of G;
- -
- : the core height of G;
- -
- : the 1-path connectivity of G;
- -
- (): the number of vertices of degree i in G;
- -
- (: the number of core vertices with chemical element ;
- -
- (: the number of non-core vertices with chemical element ;
- -
- : the average of of atoms in G;
- -
- (): the number of double and triple bonds in core edges;
- -
- (): the number of double and triple bonds in non-core edges;
- -
- (): the number of adjacency-configurations of core edges;
- -
- (): the number of adjacency-configurations of non-core edges;
- -
- : the topological parameter of H; and
- -
- : the number of hydrogen atoms to be included in G; i.e.,.
The number k of descriptors in our feature vector is .
2.2. A Method for Inferring Chemical Graphs
This section reviews the framework that solves the inverse QSAR/QSPR by using MILPs [18]. For a specified chemical property such as boiling point, we denote by the observed value of the property for a chemical compound G. As the Phase 1, we solve (I) Prediction Problem with the following three steps.
Phase 1.
Step 1: Let be a set of chemical graphs. For a specified chemical property , choose a class of graphs such as acyclic graphs or monocyclic graphs. Prepare a data set such that the value of each chemical graph , is available. Set reals so that , . See Figure 3 for an illustration of Step 1.
Figure 3.
An illustration of Step 1: A data set of chemical graphs , in a class of graphs whose values of a chemical property are available.
Step 2: Introduce a feature function for a positive integer k. We call the feature vector of , and call each entry of a vector a descriptor of G. See Figure 4 for an illustration of Step 2.
Figure 4.
An illustration of Step 2: Each chemical graph is mapped to a vector in a feature vector space for some positive integer k.
Step 3: Construct a prediction function with an ANN that, given a vector in , returns a real in the range so that takes a value nearly equal to for many chemical graphs in D. See Figure 5 for an illustration of Step 3.
Figure 5.
An illustration of Step 3: A prediction function from the feature vector space to the range is constructed based on an ANN .
Next we explain how to solve the inverse problem to the prediction in Phase 1 using an MILP formulation. A vector is called admissible if there is a graph such that [18]. Let denote the set of admissible vectors . In this paper, we use the range-based method to define an applicability domain (AD) [24] to our inverse QSAR/QSPR. Set and to be the minimum and maximum values of the j-th descriptor in over all graphs , (where we possibly normalize some descriptors such as , which is normalized with ). Define our AD to be the set of vectors such that for the variable of each j-th descriptor, . As the second phase, we solve (II) Inverse Problem for the inverse QSAR/QSPR by treating the following inference problems.
(II-a) Inference of Vectors
Input: A real .
Output: Vectors and such that and forms a chemical graph with .
(II-b) Inference of Graphs
Input: A vector .
Output: All graphs such that .
To treat Problem (II-a), we use MILPs for inferring vectors in ANNs [17]. In MILPs, we can easily impose additional linear constraints or fix some variables to specified constants. We include into the MILP a linear constraint such that to obtain the next result.
Theorem 1.
Let be an ANN with a piecewise-linear activation function for an input vector , denote the number of nodes in the architecture and denote the total number of break-points over all activation functions. Then there is an MILP that consists of variable vectors , , and an auxiliary variable vector for some integer and a set of constraints on these variables such that: if and only if there is a vector feasible to .
See Appendix A for the set of constraints to define our AD in the MILP in Theorem 1.
To attain the admissibility of inferred vector , we also introduce a variable vector for some integer q and a set of constraints on x and g such that holds in the following sense: is feasible to the MILP if and only if forms a chemical graph with . The Phase 2 consists of the next two steps.
Phase 2.
Step 4: Formulate Problem (II-a) as the above MILP based on and . Find a set of vectors such that for a tolerance set to be a small positive real. See Figure 6 for an illustration of Step 4.
Figure 6.
An illustration of Step 4: Given a target value , solving MILP either delivers a set of vectors such that or detects that no such vector exists.
Step 5: To solve Problem (II-b), enumerate all graphs such that for each vector . See Figure 7 for an illustration of Step 5.
Figure 7.
An illustration of Step 5: For each vector , all chemical graphs such that are generated.
In this paper, we set a graph class to be the set of rank-2 graphs. In Step 4, we solve an MILP that is formulated on a novel idea of representing rank-2 chemical graphs, as will be discussed in Section 2.3.2. In Step 5, we use branch-and-bound algorithms for enumerating rank-2 chemical compounds [25,26].
2.3. Representing Rank-2 Chemical Graphs
This section introduces a method of modeling rank-2 chemical graphs with different cyclic structures in a unified way and proposes an MILP formulation that represents a rank-2 chemical graph G of n vertices.
2.3.1. Scheme Graphs and Tree-Extensions
Given positive integers and p, a graph with vertices and p edges can be represented as a subgraph of a complete graph with edges. However, formulating this as an MILP may require to prepare variables and constraints. To reduce the number of variables and constraints in an MILP that represents a rank-2 graph, we decompose a rank-2 graph G into the core and non-core of G so that the core is represented by one of the three rank-2 polymer topologies and the non-core is a collection of trees in which the height is bounded by the core height of G. We do not specify how many subtrees will be attached to each edge in the polymer topology in advance, since otherwise we would need a different MILP for a distinct combination of such assignments of subtrees. Instead we allow each edge in a polymer topology to collect a necessary number of subtrees in our MILP (see the next section for more detail). In this section, we introduce a “scheme graph” to represent three possible rank-2 polymer topologies, an “extension” of the scheme graph to represent the core of a rank-2 graph and a “tree-extension” to represent a combination of the core and non-core of a rank-2 graph, so that any of the three kinds of rank-2 polymer topologies can be selected in a single MILp formulation.
Scheme Graphs
Formally, we define the scheme graph for rank 2 to be a pair of a multigraph K and an ordered partition of the edge set . Figure 2d illustrates the scheme graph . An edge in is called a semi-edge, an edge in is called a virtual edge and an edge in is called a real edge.
Extensions of Scheme Graphs
Based on the scheme graph , we construct the core of a rank-2 graph H as an “extension,” which is defined as follows (see also Figure 8). The extension in Figure 9a An extension of the scheme graph is defined to be a simple graph obtained from K by using each real edge , by eliminating or replacing each virtual edge (resp., semi-edge ) with a -path of length at least two (resp., 1) in the core of H, where a -path of length 1 means an edge . Figure 9a illustrates an extension of the scheme graph which is obtained by removing virtual edges and by replacing semi-edge with a path , semi-edge with a path ) and by using semi-edge and real edges . The extension in Figure 9a is isomorphic to the core of the rank-2 graph H in Figure 9b. Observe that each of the least simple graphs , in Figure 2 is obtained as an extension of the scheme graph in Figure 2d.
Figure 8.
An illustration of a tree-extension, where the vertices in are depicted with gray circles: (a) The structure of the rooted tree rooted at a vertex ; (b) the structure of the rooted tree rooted at a vertex ; (c) the -tree-extension of the scheme graph in Figure 2d for , and .
Figure 9.
(a) An example of an extension of the scheme graph; (b) an example of a rank-2 graph H with , , and , where the labels of some vertices and edges indicate the corresponding vertices and edges in the -tree-extension for , , , and ; (c) a subgraph of -tree-extension isomorphic to the rank-2 graph H in (b).
Tree-Extensions
Let denote the number of vertices in the scheme graph. For non-negative integers a, b and c, we consider a rank-2 graph H such that , and the maximum degree of a core vertex is at most c. We define an “-tree-extension” as a minimal supergraph of all such rank-2 graphs H. Formally, the -tree-extension (or a tree-extension) is defined to be the graph obtained by augmenting the graph K as follows:
Figure 8 illustrates the -tree-extension of the scheme graph. We show how a rank-2 graph can be constructed as a subgraph of a tree-extension with some example. Figure 9b illustrates a rank-2 graph H with , , and , where the maximum degree of a non-core vertex is 3. To prepare a tree-extension so that the graph H can be a subgraph of the tree-extension, we set , , and . Figure 9c illustrates a subgraph of the -tree-extension such that is isomorphic to the rank-2 graph H.
2.3.2. MILPs for Rank-2 Chemical Graphs
We present an outline of our MILP in Step 4 of the framework. For integers , , let denote the set of rank-2 graphs H such that the degree of each core vertex is at most 4, the degree of each non-core vertex is at most , , , and . In this paper, we obtain the following result.
Theorem 2.
Let Λ be a set of chemical elements, Γ be a set of adjacency-configurations, where , and . Given integers , , and , there is an MILP that consists of variable vectors and for some integer and a set of constraints on these variables such that: is feasible to if and only if forms a rank-2 chemical graph such that and .
Note that our MILP requires only variables and constraints when the maximum core height of a subtree in the non-core of and are constant. We formulate an MILP in Theorem 2 so that such a graph H is selected as a subgraph of the scheme graph.
We explain the basic idea of our MILP. Define
where and are the numbers of vertices and non-leaf vertices in the rooted tree , respectively. The MILP mainly consists of the following three types of constraints.
- Constraints for selecting a rank-2 graph H as a subgraph of the -tree-extension of the scheme graph ;
- Constraints for assigning chemical elements to vertices and multiplicity to edges to determine a chemical graph ;
- Constraints for computing descriptors from the selected rank-2 chemical graph G; and
- Constraints for reducing the number of rank-2 chemical graphs that are isomorphic to each other but can be represented by the above constraints.
In the constraints of 1, we treat each edge in the tree-extension as a directed edge because describing some condition for H to belong to becomes slightly easier than the case of undirected graphs. More formally we prepare the following.
- (i)
- In the scheme graph , denote the edges in by , and (where ), and regard each edge as a directed edge from one end-vertex to the other end-vertex with . Let be a binary variable for each edge , .
- (ii)
- In each tree (resp., ) in the tree-extension, we regard each edge , in the rooted tree , (resp., , in the rooted tree , ) as a directed edge from vertex to vertex (resp., from vertex to vertex ). Let (resp., ) be a binary variable for vertex , (resp., ) and ;
- (iii)
- In the path consisting of the roots of trees , , we regard each edge , as a directed edge from vertex to vertex ; and
- (iv)
- We regard each edge for and as two directed edges, one directed from vertex to vertex and the other directed oppositely. Let (resp., ) be a binary variable of directed edge (resp., ).
Based on these, we include constraints with some more additional variables so that a selected subgraph H is a connected rank-2 graph. See constraints Equations (A10) to (A42) in Appendix A for the details.
In the constraints of 2, we prepare an integer variable for each vertex u in the tree-extension that represents the chemical element if u is in a selected graph H (or otherwise) and an integer variable (resp., ) for each edge e (resp., or , , ) in the tree-extension that represents the multiplicity if e is in a selected graph H (or or takes 0 otherwise). This determines a chemical graph . Also we include constraints for a selected chemical graph G to satisfy the valence condition for each edge . See constraints Equations (A43) to (A61) in Appendix A for the details.
In the constraints of 3, we introduce a variable for each descriptor and constraints with some more variables to compute the value of each descriptor in for a selected chemical graph G. See constraints Equations (A62) to (A113) in Appendix A for the details.
With constraints 1 to 3, our MILP formulation already represents a rank-2 chemical graph G and a feature vector so that holds. In the constraints of 4, we include some additional constraints so that the search space required for an MILP solver to solve an instance of our MILP problem is reduced. For this, we consider a graph-isomorphism of rooted subtrees of each tree or and define a canonical form among subtrees that are isomorphic to each other. We try to eliminate a chemical graph G that has a subtree in or that is not a canonical form. See constraints Equations (A114) to (A119) in Appendix A for the details.
3. Results
We implemented our method of Steps 1 to 5 for inferring rank-2 chemical graphs and conducted experiments to evaluate the computational efficiency for three chemical properties : octanol/water partition coefficient (Kow), melting point (Mp), and boiling point (Bp). We executed the experiments on a PC with Intel Core i5 1.6 GHz CPU and 8GB of RAM running under the Mac OS operating system version 10.14.6. We show 2D drawings of some of the inferred chemical graphs, where ChemDoodle version 10.2.0 is used for constructing the drawings.
Results on Phase 1.
Step 1. We set a graph class to be the set of all rank-2 chemical graphs. For each property {Kow, Mp, Bp}, we select a set of chemical elements and collected a data set on rank-2 chemical graphs over provided by HSDB from PubChem. To construct the data set, we eliminated chemical compounds that have at most three carbon atoms or contain a charged element such as or an element in which the valence is different from our setting of valence function .
Table 1 shows the size and range of data sets that we prepared for each chemical property in Step 1, where we denote the following:
Table 1.
Results of Step 1 in Phase 1.
- -
- : one of the chemical properties Kow, Mp and Bp;
- -
- : the size of data set for property ;
- -
- : the set of chemical elements over data set (hydrogen atoms are added at the final stage);
- -
- : the number of tuples in ;
- -
- : the minimum and maximum number of non-hydrogen atoms over data set ;
- -
- , : the minimum and maximum core size and core height over chemical compounds in , respectively;
- -
- : the minimum and maximum values of the topological parameter over data set ; and
- -
- : the minimum and maximum values of in over data set .
Step 2. We used a feature function f that consists of the descriptors defined in Section 2.1.
Step 3. We used scikit-learn version 0.21.6 with Python 3.7.4 to construct ANNs where the tool and activation function are set to be MLPRegressor and ReLU, respectively. We tested several different architectures of ANNs for each chemical property. To evaluate the performance of the resulting prediction function with cross-validation, we partition a given data set into five subsets , randomly, where is used for a training set and is used for a test set in five trials . For a set of observed values and a set of predicted values, we define the coefficient of determination to be , where . Table 2 shows the results on Steps 2 and 3, where
Table 2.
Results of Steps 2 and 3 in Phase 1.
- -
- k: the number of descriptors for the chemical compounds in data set for property ;
- -
- Activation: the choice of activation function;
- -
- Architecture: consists of an input layer with a nodes, a hidden layer with b nodes, and an output layer with a single node, where a is equal to the number of descriptors;
- -
- L-time: the average time (sec.) to construct ANNs for each trial;
- -
- test (ave.): the average of coefficient of determination over the five test sets; and
- -
- test (best): the largest value of coefficient of determination over the five test sets.
For each chemical property , we selected the ANN that attained the best test score among the five ANNs to formulate an MILP in the second phase.
Results on Phase 2.
We implemented Steps 4 and 5 in Phase 2 as follows.
Step 4. In this step, we solve the MILP formulated based on the ANN obtained in Phase 1. To solve an MILP in Step 4, we use CPLEX version 12.10. In our experiment, we choose a target value and fix or bound some descriptors in our feature vector as follows:
- -
- Fix variable that represents the polymer parameter to be each integer in ;
- -
- Set to be each of 3 and 4;
- -
- Fix to be some four integers in for and for ;
- -
- Choose three integers from and fix to be each of the three integers;
- -
- Fix to be each of the four integers in .
Based on the above setting, we generated 12 instances for each . We set in Step 4.
Table 3, Table 4, Table 5, Table 6, Table 7 and Table 8 show the results of Step 4 for and 4, respectively, where we denote the following:
Table 3.
Results of Steps 4 and 5 with and .
Table 4.
Results of Steps 4 and 5 with and .
Table 5.
Results of Steps 4 and 5 with and .
Table 6.
Results of Steps 4 and 5 with and .
Table 7.
Results of Steps 4 and 5 with and .
Table 8.
Results of Steps 4 and 5 with and .
- -
- : a target value in for a property ;
- -
- : a specified number of vertices in ;
- -
- I: #I means the number of MILP instances in Step 4 (where #I = 12), and means the size of set of vectors generated from all feasible instances among the #I MILP instances in Step 4;
- -
- IP-time: the average time (sec.) to solve one of the #I MILP instances to find a set of vectors .
Figure 10a–c illustrate some rank-2 chemical graphs with constructed from the vector obtained by solving the MILP in Step 4.
Figure 10.
An illustration of inferred rank-2 chemical graphs with : (a) , , , core size = 16, core height = 3, ; (b) , , , core size = 16, core height= 2, ; (c) , , , core size = 17, core height = 4, ; (d) , , , , , core size = 14, core height = 3, .
Figure 11a–c illustrate some rank-2 chemical graphs with constructed from the vector obtained by solving the MILP in Step 4.
Figure 11.
An illustration of inferred rank-2 chemical graphs : (a) , , , core size = 14, core height = 2, ; (b) , , , core size = 16, core height = 2, ; (c) , , , core size = 17, core height = 2, .
Figure 12a–c illustrate some rank-2 chemical graphs with constructed from the vector obtained by solving the MILP in Step 4.
Figure 12.
An illustration of inferred rank-2 chemical graphs : (a) , , , core size = 15, core height = 5, ; (b) , , , core size = 17, core height = 2, ; (c) , , , core size = 17, core height = 3, .
Step 5. In this step, we modified the algorithms proposed by Tamura et al. [25] and Yamashita et al. [26] to enumerate all rank-2 graphs such that for each . We stop the execution when either the total number of graphs inferred over all vectors exceeds 100 or the execution time exceeds one hour.
Table 3, Table 4, Table 5, Table 6, Table 7 and Table 8 show the results on Step 5 for and 4, respectively,
- -
- -
- G-time: the running time (sec.) to execute Step 5, where “>1 h” means that the execution time exceeds the limit.
We also conducted some additional experiments to demonstrate that our MILP-based method is flexible to control conditions on the inference of chemical graphs. In Step 3, we constructed an ANN for each of the three chemical properties Kow, Mp, Bp}, and formulated the inverse problem of each ANN as an MILP . Since the set of descriptors is common to all three properties Kow, Mp, and Bp, it is possible to infer a rank-2 chemical graph that satisfies a target value for each of the three properties at the same time (if one exists). We specify the size of graph so that , core size := 14, core height := 3, and , and set target values with , and in an MILP that consists of the three MILPs , and . The MILP was solved in 268.11 (sec) and we obtained a rank-2 chemical graph illustrated in Figure 10d.
4. Discussion
In this paper, we proposed a new method for the inverse QSAR/QSPR to rank-2 chemical graphs by significantly enhancing the framework due to Azam et al. [18], Zhang et al. [20], and Ito et al. [21], and implemented it for inferring rank-2 chemical graphs using the algorithms for enumerating rank-2 chemical graphs due to Tamura et al. [25] and Yamashita et al. [26]. From the results on some computational experiments, we observe that the proposed method runs efficiently for an instance with non-hydrogen atoms up to Step 4 and an instance with non-hydrogen atoms up to Step 5. Due to this development, the ratio of chemical compounds covered in the PubChem database increased from 16.26% to 44.5%. It is left as future work to apply our new method for the inverse QSAR/QSPR to a wider class of graphs. The ratio of the number of chemical graphs with rank at most 3 (resp., 4) to the number of all chemical graphs in database PubChem is 68.8% (resp., 84.7%). Among rank-4 chemical compounds, Remdesivir , an antiviral medication, which is being studied as a possible post-infection treatment for COVID-19, has a chemical graph G with , , , and . The number of polymer topologies with rank 3 (resp., 4) such that the maximum degree is at most 4 is 12 (resp., 73). Our MILP formulation can be easily extended to the case of rank 3 or 4 by replacing the current set of constraints for the scheme graph with a set of those for a new scheme graph that is designed for rank-3 or -4 polymer topologies.
Author Contributions
Conceptualization, H.N. and T.A.; methodology, H.N.; software, J.Z., C.W., and A.S.; validation, J.Z., C.W., A.S., and H.N.; formal analysis, H.N.; data resources, H.N. and T.A.; writing—original draft preparation, H.N.; writing—review and editing, T.A.; project administration, H.N.; funding acquisition, T.A. All authors have read and agreed to the published version of the manuscript.
Funding
H.N. and T.A. were partially supported by the Japan Society for the Promotion of Science, Japan, under Grant #18H04113.
Conflicts of Interest
The authors declare no conflict of interest.
Appendix A. All Constraints in an MILP Formulation for Rank-2 Chemical Graphs
To formulate an MILP that represents a chemical graph , we distinguish a tuple from a tuple . For a tuple , let denote the tuple . Let . We call a tuple proper if
where the latter is assumed because otherwise G must consist of two atoms of . Assume that each tuple is proper. Let be a fictitious chemical element that represents null, call a tuple with fictitious, and define to be the set of all fictitious tuples; i.e., . To represent chemical elements in an MILP, we encode these elements into some integers denoted by . Assume that, for each element , is a positive integer and that .
Appendix A.1. Applicability Domain
We use the range-based method to define an applicability domain for our method. For this, we find the range (the minimum and maximum) of each descriptor over all relevant chemical compounds and represent each range as a set of linear constraints in the constraint set of our MILP formulation. Recall that stands for a set of chemical graphs used for constructing a prediction function. However, the number of examples in may not be large enough to capture a general feature on the structure of chemical graphs. For this, we also use some data set from the whole set of chemical graphs in a database. Let denote the set of chemical graphs such that for each integer . Formally the set of variables and constraints is given as follows.
AD constraints in :
constants:
Integers and ; An integer ;
An integer ;
variables for descriptors in x:
A real variable : represents ;
: represents the number of vertices of degree i in H;
: represents ;
, : represents the number of vertices of chemical element
a in the core of H;
, : represents the number of vertices of chemical element
a in the non-core of H;
, : represents the number of k-bonds in the core of H;
, : represents the number of k-bonds in the non-core of H;
, : represents the number of core edges
in H that are assigned tuple ;
, : represents the number of non-core edges in
H that are assigned tuple ;
constraints:
In the following, we derive an MILP that satisfies the condition in Theorem 2. Let , , and be given integers. We describe the set with several sets of constraints.
Appendix A.2. Construction of Scheme Graph and Tree-Extension
We infer a subgraph H such that the maximum degree is , , and . For this, we first construct the -tree-extension of the scheme graph . We use the following notations: For and , let (resp., ) denote the set of indices i of edges such that the tail (resp., head) of is . Let , , and .
As described in Section 2.3.1, some edge may be replaced with a subpath of , which consists of the roots of trees . We assign color i to the vertices in such a subpath by setting a variable of each vertex to be i. For each edge , we prepare a binary variable to denote that edge is used (resp., not used) in a selected graph H when (resp., ). We also include constraints necessary for the variables to satisfy a degree condition at each of the vertices , and , .
constants:
Integers , , , and ;
, : a lower bound on the out-degree of vertex in H;
, : a lower bound on the in-degree of vertex in H;
, : an upper bound on the out-degree of vertex in H;
, : an upper bound on the in-degree of vertex in H;
variables:
, : represents edge (, )
( ⇔ edge is used in H);
, , : (resp., ) represents
direction (resp., ), where (resp., ) ⇔
edge is used in H and direction (resp., ) is assigned
to edge ;
, : represents the color assigned to vertex
( ⇔ vertex is assigned color c);
, : the number of vertices with color c;
, : the out-degree of vertex in the core of H;
, : the in-degree of vertex in the core of H;
, , ( ⇔ );
constraints:
Appendix A.3. Specification for Chemical Graphs with Rank 2
To generate any of the three rank-2 polymer topologies in , we use the scheme graph , in Figure 2d, where , , , and . Recall that each color is assigned to edge . We impose some more constraints on the degree of each of the vertices , and , so that the core of a selected graph H satisfies one of the three least simple graphs in Figure 2a–c. We also let a variable mean the topological parameter of a selected subgraph H.
constants:
, ,
, , , , , ,
, , , , , ,
, , , , , ,
, , , , , ,
, , , ,
, , , ,
, , , ,
, , , ,
variables:
: The topology-parameter for rank 2;
constraints:
Appendix A.4. Selecting A Subgraph
We prepare a binary variable (resp., ) for each vertex in tree (resp., in tree ). We include constraints so that the path is partitioned into subpaths , , where possibly some is empty, and the resulting subgraph H becomes a connected rank-2 graph with , , and .
constants:
Integers , ;
Prepare the set of the indices of children of a vertex
the index of the parent of a non-root vertex , and
the set of indices i such that the height of a vertex is h
in the rooted tree ;
variables:
, , : represents vertex
( ⇔ vertex is used in H and edge is used in H);
, , : represents vertex
( ⇔ vertex is used in H and edge is used in H);
, : represents edge ,
where and are fictitious edges ( ⇔ edge is used in H);
constraints:
Appendix A.5. Assigning Multiplicity
We prepare an integer variable or for each edge e in the -tree-extension of the scheme graph to denote the multiplicity of e in a selected graph H and include necessary constraints for the variables to satisfy in H.
variables:
, : represents the multiplicity of edge ,
where if edge is not in H;
, , : with (resp., ) represents
the multiplicity of edge (resp., );
, : represents the multiplicity of edge ;
, , : represents the multiplicity of edge ;
constraints:
Appendix A.6. Assigning Chemical Elements and Valence Condition
We include constraints so that each vertex v in a selected graph H satisfies the valence condition; i.e., . With these constraints, a rank-2 chemical graph on a selected subgraph H will be constructed.
constants:
A set of chemical elements, where denotes null;
A coding , such that ; , ; and if ;
Let and denote and , respectively;
A valence function: ;
variables:
, , :
with (resp., ) represents (resp., );
, , , :
⇔ for and for ;
, , , :
⇔ the multiplicity of edge in H is k;
, , , :
⇔ the multiplicity of edge , (or , ) in H is k;
, , :
⇔ the multiplicity of edge in H is k;
, , , :
⇔ the multiplicity of edge in H is k;
constraints:
Appendix A.7. Descriptors for Mass, the Numbers of Elements and Bonds
We include constraints to compute descriptors , (, () and according to the definitions in Section 2.1.2.
constants:
A function ; Let denote the observed mass of a chemical element , and
define ;
variables:
, ;
, ;
;
, ;
, ;
: the number of hydrogen atoms to be included in G;
constraints:
Appendix A.8. Descriptor for the Number of Specified Degree
We include constraints to compute descriptors () according to the definitions in Section 2.1.2. We also add constraints so that the maximum degree of a non-core vertex in H is at most 3 (resp., equal to 4) when (resp., .
variables:
, , :
represents for or for ;
, , , :
⇔ ;
, ;
constraints:
Appendix A.9. Descriptor for the Number of Adjacency-Configurations
We include constraints to compute descriptors and () according to the definitions in Section 2.1.2.
constants:
A set of proper tuples ;
The set ;
variables:
, , :
⇔ edge is assigned tuple ; i.e., ;
, , :
⇔ edge is assigned tuple ; i.e., ;
, , , :
⇔ edge , (or , ) is assigned tuple ; i.e.,
;
, , , :
⇔ edge is assigned tuple ; i.e., ;
, ;
, ;
constraints:
Appendix A.10. Descriptor for 1-Path Connectivity
We include constraints to compute descriptor according to the definition.
variables:
A real variable ;
, , , :
⇔ and ,
where is in H if and only if ;
, , : ⇔
and where is in H if and only if ;
, , , : ⇔
and for
(or and for ),
where edge or is in H if and only if ;
, , , , :
⇔ and ,
where is in H if and only if ;
constraints:
where a tolerance is set to be .
Appendix A.11. Constraints for Left-Heavy Trees
To reduce the number of rank-2 chemical graphs G that are isomorphic to each other, we include in some additional constraints so that each subtree selected from tree or satisfies the following property:
for any two siblings and , in , the number of descendants of is not smaller than that of .
For this, we define to be the number of descendants of a vertex (or ) in a selected graph H and , , . We include constraints that compute the values of recursively.
variables:
, , : the number of descendants of vertex
in tree for and vertex in tree for ;
constraints:
References
- Miyao, T.; Kaneko, H.; Funatsu, K. Inverse QSPR/QSAR analysis for chemical structure generation (from y to x). J. Chem. Inf. Model. 2016, 56, 286–299. [Google Scholar] [CrossRef] [PubMed]
- Skvortsova, M.I.; Baskin, I.I.; Slovokhotova, O.L.; Palyulin, V.A.; Zefirov, N.S. Inverse problem in QSAR/QSPR studies for the case of topological indices characterizing molecular shape (Kier indices). J. Chem. Inf. Comput. Sci. 1993, 33, 630–634. [Google Scholar] [CrossRef]
- Ikebata, H.; Hongo, K.; Isomura, T.; Maezono, R.; Yoshida, R. Bayesian molecular design with a chemical language model. J. Comput. Aided Mol. Des. 2017, 31, 379–391. [Google Scholar] [CrossRef] [PubMed]
- Rupakheti, C.; Virshup, A.; Yang, W.; Beratan, D.N. Strategy to discover diverse optimal molecules in the small molecule universe. J. Chem. Inf. Model. 2015, 55, 529–537. [Google Scholar] [CrossRef] [PubMed]
- Fujiwara, H.; Wang, J.; Zhao, L.; Nagamochi, H.; Akutsu, T. Enumerating treelike chemical graphs with given path frequency. J. Chem. Inf. Model. 2008, 48, 1345–1357. [Google Scholar] [CrossRef] [PubMed]
- Kerber, A.; Laue, R.; Grüner, T.; Meringer, M. MOLGEN 4.0. Match Commun. Math. Comput. Chem. 1998, 37, 205–208. [Google Scholar]
- Li, J.; Nagamochi, H.; Akutsu, T. Enumerating substituted benzene isomers of tree-like chemical graphs. IEEE/ACM Trans. Comput. Biol. Bioinform. 2016, 15, 633–646. [Google Scholar] [CrossRef] [PubMed]
- Reymond, J.L. The chemical space project. Accounts Chem. Res. 2015, 48, 722–730. [Google Scholar] [CrossRef] [PubMed]
- Akutsu, T.; Fukagawa, D.; Jansson, J.; Sadakane, K. Inferring a Graph From Path Frequency. Discret. Appl. Math. 2012, 160, 1416–1428. [Google Scholar] [CrossRef][Green Version]
- Nagamochi, H. A detachment algorithm for inferring a graph from path frequency. Algorithmica 2009, 53, 207–224. [Google Scholar] [CrossRef]
- Fazekas, S.Z.; Ito, H.; Okuno, Y.; Seki, S.; Taneishi, K. On computational complexity of graph inference from counting. Nat. Comput. 2013, 12, 589–603. [Google Scholar] [CrossRef]
- Bohacek, R.S.; McMartin, C.; Guida, W.C. The art and practice of structure-based drug design: A molecular modeling perspective. Med. Res. Rev. 1996, 16, 3–50. [Google Scholar] [CrossRef]
- Gómez-Bombarelli, R.; Wei, J.N.; Duvenaud, D.; Hernández-Lobato, J.M.; Sánchez-Lengeling, B.; Sheberla, D.; Aguilera-Iparraguirre, J.; Hirzel, T.D.; Adams, R.P.; Aspuru-Guzik, A. Automatic chemical design using a data-driven continuous representation of molecules. ACS Cent. Sci. 2018, 4, 268–276. [Google Scholar] [CrossRef] [PubMed]
- Segler, M.H.S.; Kogej, T.; Tyrchan, C.; Waller, M.P. Generating focused molecule libraries for drug discovery with recurrent neural networks. ACS Cent. Sci. 2017, 4, 120–131. [Google Scholar] [CrossRef] [PubMed]
- Yang, X.; Zhang, J.; Yoshizoe, K.; Terayama, K.; Tsuda, K. ChemTS: An efficient python library for de novo molecular generation. Sci. Technol. Adv. Mater. 2017, 18, 972–976. [Google Scholar] [CrossRef] [PubMed]
- Kusner, M.J.; Paige, B.; Hernández-Lobato, J.M. Grammar variational autoencoder. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; Volume 70, pp. 1945–1954. [Google Scholar]
- Akutsu, T.; Nagamochi, H. A Mixed Integer Linear Programming Formulation to Artificial Neural Networks. In Proceedings of the 2nd International Conference on Information Science and Systems, Tokyo, Japan, 16–19 March 2019; pp. 215–220. [Google Scholar]
- Azam, N.A.; Chiewvanichakorn, R.; Zhang, F.; Shurbevski, A.; Nagamochi, H.; Akutsu, T. A method for the inverse QSAR/QSPR based on artificial neural networks and mixed integer linear programming. In Proceedings of the 13th International Joint Conference on Biomedical Engineering Systems and Technologies, Valletta, Malta, 24–26 February 2020; Volume 3, pp. 101–108. [Google Scholar]
- Chiewvanichakorn, R.; Wang, C.; Zhang, Z.; Shurbevski, A.; Nagamochi, H.; Akutsu, T. A method for the inverse QSAR/QSPR based on artificial neural networks and mixed integer linear programming. In Proceedings of the ICBBB2020, Kyoto, Japan, 19–22 January 2020. [Google Scholar]
- Zhang, F.; Zhu, J.; Chiewvanichakorn, R.; Shurbevski, A.; Nagamochi, H.; Akutsu, T. A new integer linear programming formulation to the inverse QSAR/QSPR for acyclic chemical compounds using skeleton trees. In Proceedings of the 33rd International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems, Kitakyushu, Japan, 22–25 September 2020. [Google Scholar]
- Ito, R.; Azam, N.A.; Wang, C.; Shurbevski, A.; Nagamochi, H.; Akutsu, T. A novel method for the inverse QSAR/QSPR to monocyclic chemical compounds based on artificial neural networks and integer programming, 2020. In Proceedings of the BIOCOMP 2020, Las Vegas, NV, USA, 27–30 July 2020. [Google Scholar]
- Suzuki, M.; Nagamochi, H.; Akutsu, T. Efficient enumeration of monocyclic chemical graphs with given path frequencies. J. Cheminform. 2014, 6, 31. [Google Scholar] [CrossRef] [PubMed]
- Tezuka, Y.; Oike, H. Topological polymer chemistry. Prog. Polym. Sci. 2002, 27, 1069–1122. [Google Scholar] [CrossRef]
- Netzeva, T.I.; Worth, A.P.; Aldenberg, T.; Benigni, R.; Cronin, M.T.; Gramatica, P.; Jaworska, J.S.; Kahn, S.; Klopman, G.; Marchant, C.A.; et al. Current status of methods for defining the applicability domain of (quantitative) structure-activity relationships: The report and recommendations of ECVAM workshop 52. Altern. Lab. Anim. 2005, 33, 155–173. [Google Scholar] [CrossRef] [PubMed]
- Tamura, Y.; Nishiyama, Y.; Wang, C.; Sun, Y.; Shurbevski, A.; Nagamochi, H.; Akutsu, T. Enumerating chemical graphs with mono-block 2-augmented tree structure from given upper and lower bounds on path frequencies. arXiv 2020, arXiv:2004.06367. [Google Scholar]
- Yamashita, K.; Masui, R.; Zhou, X.; Wang, C.; Shurbevski, A.; Nagamochi, H.; Akutsu, T. Enumerating chemical graphs with two disjoint cycles satisfying given path frequency specifications. arXiv 2020, arXiv:2004.08381. [Google Scholar]
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).