A novel method for inference of chemical compounds of cycle index two with desired properties based on artificial neural networks and integer programming

: Inference of chemical compounds with desired properties is important for drug design, chemo-informatics, and bioinformatics, to which various algorithmic and machine learning techniques have been applied. Recently, a novel method has been proposed for this inference problem using both artiﬁcial neural networks (ANN) and mixed integer linear programming (MILP). This method consists of the training phase and the inverse prediction phase. In the training phase, an ANN is trained so that the output of the ANN takes a value nearly equal to a given chemical property for each sample. In the inverse prediction phase, a chemical structure is inferred using MILP and enumeration so that the structure can have a desired output value for the trained ANN. However, the framework has been applied only to the case of acyclic and monocyclic chemical compounds so far. In this paper, we signiﬁcantly extend the framework and present a new method for the inference problem for rank-2 chemical compounds (chemical graphs with cycle index 2). The results of computational experiments using such chemical properties as octanol/water partition coefﬁcient, melting point, and boiling point suggest that the proposed method is much more useful than the previous method.


Introduction
Inference of chemical compounds with desired properties is important for computer-aided drug design.Since drug design is one of the major targets of chemo-informatics and bioinformatics, it is also important in these areas.Indeed, this problem has been extensively studied in chemo-informatics under the name of inverse QSAR/QSPR [1,2], where QSAR/QSPR denotes Quantitative Structure Activity/Property Relationships.Since chemical compounds are usually represented as undirected graphs, this problem is important also from graph theoretic and algorithmic viewpoints.
Inverse QSAR/QSPR is often formulated as an optimization problem to find a chemical graph maximizing (or minimizing) an objective function under various constraints, where objective functions reflect certain chemical activities or properties.In many cases, objective functions are derived from a set of training data consisting of known molecules and their activities/properties using statistical and machine learning methods.
In both forward and inverse QSAR/QSPR, chemical graphs are often represented as vectors of real or integer numbers because it is difficult to directly handle graphs using statistical and machine learning methods.Elements of these vectors are called descriptors in QSAR/QSPR studies, and these vectors correspond to feature vectors in machine learning.Using these chemical descriptors, various heuristic and statistical methods have been developed for finding optimal or nearly optimal graph structures under given objective functions [1,3,4].In many cases, inference or enumeration of graph structures from a given feature vector is a crucial subtask in these methods.Various methods have been developed for this enumeration problem [5][6][7][8] and the computational complexity of the inference problem has been analyzed [9][10][11].On the other hand, enumeration in itself is a challenging task, since the number of molecules (i.e., chemical graphs) with up to 30 atoms (vertices) C, N, O, and S, may exceed 10 60 [12].
As in many other fields, Artificial Neural Network (ANN) and deep learning technologies have recently been applied to inverse QSAR/QSPR.For example, variational autoencoders [13], recurrent neural networks [14,15], and grammar variational autoencoders [16] have been applied.In these approaches, new chemical graphs are generated by solving a kind of inverse problems on neural networks, where neural networks are trained using known chemical compound/activity pairs.However, the optimality of the solution is not necessarily guaranteed in these approaches.In order to guarantee the optimality, a novel approach has been proposed [17] for ANNs with ReLU activation functions and sigmoid activation functions, using mixed integer linear programming (MILP).In their approach, activation functions on neurons are efficiently encoded as piece-wise linear functions so as to represent ReLU functions exactly and sigmoid functions approximately.
Recently, a new framework has been proposed [18][19][20] by combining two previous approaches; efficient enumeration of tree-like graphs [5], and MILP-based formulation of the inverse problem on ANNs [17].This combined framework for inverse QSAR/QSPR mainly consists of two phases, one for constructing a prediction function to a chemical property, and the other for constructing graphs based on the inverse of the prediction function.The first phase solves (I) PREDICTION PROBLEM, where a prediction function ψ N on a chemical property π is constructed with an ANN N using a data set of chemical compounds G and their values a(G) of π.The second phase solves (II) INVERSE PROBLEM, where (II-a) given a target value y * of the chemical property π, a feature vector x * is inferred from the trained ANN N so that ψ N (x * ) is close to y * and (II-b) then a set of chemical structures G * such that f (G * ) = x * is enumerated.In (II-b) of the above-mentioned previous methods [18][19][20], an MILP is formulated for acyclic chemical compounds.Their methods were applicable only to acyclic chemical graphs (i.e., tree-structured chemical graphs), where the ratio of acyclic chemical graphs in a major chemical database (PubChem) is 2.91%.Afterward, Ito et al. [21] designed a method of inferring monocyclic chemical graphs (chemical graphs with cycle index or rank 1) by formulating a new MILP and using an efficient algorithm for enumerating monocyclic chemical graphs [22].This still leaves a big limitation because the ratio of acyclic and monocyclic chemical graphs in the chemical database PubChem is only 16.26%.
To break this limitation, we significantly extend the MILP-based approach for inverse QSAR/QSPR so that "rank-2 chemical compounds" (chemical graphs with cycle index or rank 2) can be efficiently handled, where the ratio of chemical graphs with rank at most 2 in the database PubChem is 44.5%.Note that there are three different topological structures, called polymer-topologies over all rank-2 chemical compounds.In particular, we propose a novel MILP formulation for (II-a) along with a new set of descriptors.One big advantage of this new formulation is that an MILP instance has a solution if and only if there exists a rank-2 chemical graph satisfying given constraints, which is useful to significantly reduce redundant search in (II-b).We conducted computational experiments to infer rank-2 chemical compounds on several chemical properties.
The paper is organized as follows.Section 2.1 introduces some notions on graphs, a modeling of chemical compounds, and a choice of descriptors.Section 2.2 reviews the framework for inferring chemical compounds based on ANNs and MILPs.Section 2.3 introduces a method of modeling rank-2 chemical graphs with different cyclic structures in a unified way and proposes an MILP formulation that represents a rank-2 chemical graph G of n vertices, where our MILP requires only O(n) variables and constraints when the maximum height of subtrees in G is constant.Section 3 reports the results on some computational experiments conducted for chemical properties such as octanol/water partition coefficient, melting point, and boiling point.Section 4 makes some concluding remarks.Appendix A provides the detail of all variables and constraints in our MILP formulation.

Preliminary
This section introduces some notions and terminology on graphs, a modeling of chemical compounds, and our choice of descriptors.

Multigraphs and Graphs
Let R and Z denote the sets of reals and non-negative integers, respectively.For two integers a and b, let [a, b] denote the set of integers i with a ≤ i ≤ b.

Multigraphs
A multigraph is defined to be a pair (V, E) of a vertex set V and an edge set E such that each edge e ∈ E joins two vertices u, v ∈ V (possibly u = v) and the vertices u and v are called the end-vertices of the edge e, and let V(e) denote the set of the end-vertices of an edge e ∈ E, where an edge e with |V(e)| = 1 is called a loop.We denote the vertex and edge sets of a multigraph M by V(M) and E(M), respectively.A path with end-vertices u and v is called a u, v-path, and the length of a path is defined to be the number of edges in the path.
Let M be a multigraph.An edge e ∈ E(M) is called multiple (to an edge e ∈ E(M)) if there is another edge e ∈ E(M) with V(e) = V(e ).For a vertex v ∈ V(M), the set of neighbors of v in M is denoted by N M (v), and the degree deg M (v) of v is defined to be the number of times an edge in E(M) is incident to v; i.e., deg |V(e)| = 1}|.A multigraph is called simple if it has no loop and there is at most one edge between any two vertices.We observe that the sum of the degrees over all vertices is twice the number of edges in any multigraph M; i.e., For a subset X of vertices in M, let M − X denote the multigraph obtained from M by removing the vertices in X and any edge incident to a vertex in X.An operation of subdividing a non-loop edge (resp., loop) e ∈ E(M) with V(e) = {v 1 , v 2 } (resp., V(e) = {v 1 = v 2 }) is to replace e with two new edges e 1 and e 2 incident to a new vertex v e such that each e i is incident to v i .An operation of contracting a vertex u of degree 2 in M is to replace the two edges uv and uv incident to u with a single edge vv removing vertex u, where the resulting edge is a loop when v = v .The rank r(M) of a multigraph M is defined to be the minimum number of edges to be removed to make the multigraph acyclic.We call a multigraph M with r(M) = k a rank-k graph.Let V deg,i (M) denote the set of vertices of degree i in M. The core Cr(M) of M is defined to be an induced subgraph M * that is obtained from M := M by setting M := M − V deg,1 (M ) repeatedly until M * contains at most two vertices or consists of vertices of degree at least 2. The core M * of a connected multigraph M consists of a single vertex (resp., two vertices) if and only if M is a tree with an even (resp., odd) diameter.A vertex (resp., an edge) in M is called a core vertex (resp., core edge) if it is contained in the core of M and is called a non-core vertex (resp., non-core edge) otherwise.The core size cs(M) is defined to be the number of core vertices of M, and the core height ch(M) is defined to be the maximum length of a path between a vertex v ∈ V(M * ) to a leaf of M without passing through any core edge.The set of non-core edges induces a collection of subtrees, each of which we call a non-core component of M, where each non-core component C contains exactly one core vertex v C and we regard C as a tree rooted at v C .Let C be a non-core component of M. The height height(v) of a vertex v in C is defined to be the maximum length of a path from v to a leaf u in the descendants of v.
A multigraph is called a polymer topology if it is connected and the degree of every vertex is at least 3. Tezuka and Oike [23] pointed out that a classification of polymer topologies will lay a foundation for elucidation of structural relationships between different macro-chemical molecules and their synthetic pathways.For integers r ≥ 0 and d ≥ 3, let P T (r, d) denote the set of all rank-r polymer topologies with maximum degree at most d. Figure 1 illustrates the three rank-2 polymer topologies in P T (2,4).For a polymer topology M, the least simple graph S(M) of M is defined to be a simple graph obtained from M by subdividing each loop in M with two new vertices of degree 2 and subdividing all multiple edges (except for one) between every two adjacent vertices in M. Note that |V(S(M))| = |V(M)| + r + s for the rank r of M and the number s of loops in M.
The polymer topology Pt(M) of a multigraph M with r(M) ≥ 2 is defined to be a multigraph M of degree at least 3 that is obtained from the core Cr(M) by contracting all vertices of degree 2. Note that r(Pt(M)) = r(M).Figure 2a-c illustrate the least simple graph S(M) of each polymer topology M ∈ P T (2,4), where Figure 2d illustrates a graph that contains all least simple graphs.
An illustration of the least simple graphs of the rank-2 polymer topologies M 1 , M 2 , M 3 ∈ P T (2,4) in Figure 1 and a where each edge u i u j is directed from one end-vertex u i to the other end-vertex u j with i < j, and ), a 7 = (u 3 , u 4 )}, and the edges in E 1 (resp., E 2 and E 3 ) are depicted with dashed (resp., dotted and solid) lines.

Graphs
Let H = (V, E) be a graph with a set V of vertices and a set E of edges.Define the 1-path connectivity κ 1 (H) of H to be Let H be a rank-2 connected graph such that the maximum degree is at most 4. We see that H contains two vertices v a and v b such that either there are three disjoint paths between v a and v b or H contains two edge disjoint cycles C and C , which are joined with a path between v a and v b (possibly v a = v b ).We introduce the topological parameter θ(H) of rank-2 connected graph H as follows.When H has three disjoint paths between v a and v b , define θ(H) to be the minimum number of edges along a path between v a and v b .When H contains two edge disjoint cycles C and C , which are joined with a path P between v a and v b (possibly v a = v b ), define θ(H) to be −|E(P)|.
For positive integers a, b and c with b ≥ 2, let T(a, b, c) denote the rooted tree such that the number of children of the root is a, the number of children of each non-root internal vertex is b and the distance from the root to each leaf is c.In the rooted tree T(a, b, c), we denote the vertices by v 1 , v 2 , . . ., v n (n = a(b c − 1)/(b−1) + 1) with a breadth-first-search order, and denote the edge between a vertex v i with i ∈ [2, n] and its parent by e i .For each vertex v i in T(a, b, c), let Cld(i) denote the set of indices j such that v j is a child of v i , and prt(i) denote the index j such that v j is the parent of v i when i ∈ [2, n].

Modeling of Chemical Compounds Chemical Graphs
We represent the graph structure of a chemical compound as a graph with labels on vertices and multiplicity on edges in a hydrogen-suppressed model.Nearly 68.5% (resp., 99%) of the rank-2 chemical graphs with at most 200 non-hydrogen atoms registered in chemical database PubChem have a maximum degree at most 3 (resp., 4) for all non-core vertices in the hydrogen-suppressed model.
Let Λ be a set of labels, each of which represents a chemical element such as C (carbon), O (oxygen), N (nitrogen) and so on, where we assume that Λ does not contain H (hydrogen). Let mass(a) and val(a) denote the mass and valence of a chemical element a ∈ Λ, respectively.In our model, we use integers mass * (a) = 10 • mass(a) , a ∈ Λ and assume that each chemical element a ∈ Λ has a unique valence val(a) ∈ [1,4].
We introduce a total order < over the elements in Λ according to their mass values; i.e., we write a < b for chemical elements a, b ∈ Λ with mass(a) < mass(b).Choose a set A pair of two atoms a and b joined with a bond of multiplicity k is denoted by a tuple γ = (a, b, k) ∈ Γ, called the adjacency-configuration of the atom pair.
We use a hydrogen-suppressed model because hydrogen atoms can be added at the final stage.A chemical graph over Λ and Γ is defined to be a tuple Let G(Λ, Γ) denote the set of chemical graphs over Λ and Γ.

Descriptors
In our method, we use only graph-theoretical descriptors for defining a feature vector, which facilitates our designing an algorithm for constructing graphs.Given a chemical graph G = (H, α, β), we define a feature vector f (G) that consists of the following 14 kinds of descriptors: - The number k of descriptors in our feature vector

A Method for Inferring Chemical Graphs
This section reviews the framework that solves the inverse QSAR/QSPR by using MILPs [18].For a specified chemical property π such as boiling point, we denote by a(G) the observed value of the property π for a chemical compound G.As the Phase 1, we solve (I) PREDICTION PROBLEM with the following three steps.

Phase 1.
Step 1: Let DB be a set of chemical graphs.For a specified chemical property π, choose a class G of graphs such as acyclic graphs or monocyclic graphs.Prepare a data set Step 2: Introduce a feature function f : G → R k for a positive integer k.We call f (G) the feature vector of G ∈ G, and call each entry of a vector f (G) a descriptor of G. See Figure 4 for an illustration of Step 2. Step 3: Construct a prediction function ψ N with an ANN N that, given a vector in R k , returns a real in the range [a, a] so that ψ N ( f (G)) takes a value nearly equal to a(G) for many chemical graphs in D.

R k
See Figure 5 for an illustration of Step 3. Next we explain how to solve the inverse problem to the prediction in Phase 1 using an MILP formulation.A vector x ∈ R k is called admissible if there is a graph G ∈ G such that f (G) = x [18].Let A denote the set of admissible vectors x ∈ R k .In this paper, we use the range-based method to define an applicability domain (AD) [24] to our inverse QSAR/QSPR.Set x j and x j to be the minimum and maximum values of the j-th descriptor x j in f (G i ) over all graphs G i , i = 1, 2, . . ., m (where we possibly normalize some descriptors such as ce co a (G), which is normalized with ce co a (G)/n(H)).Define our AD D to be the set of vectors x ∈ R k such that x j ≤ x j ≤ x j for the variable x j of each j-th descriptor, j = 1, 2, . . ., k.As the second phase, we solve (II) INVERSE PROBLEM for the inverse QSAR/QSPR by treating the following inference problems.

(II-a) Inference of Vectors
To treat Problem (II-a), we use MILPs for inferring vectors in ANNs [17].In MILPs, we can easily impose additional linear constraints or fix some variables to specified constants.We include into the MILP a linear constraint such that x ∈ D to obtain the next result.
Theorem 1.Let N be an ANN with a piecewise-linear activation function for an input vector x ∈ R k , n A denote the number of nodes in the architecture and n B denote the total number of break-points over all activation functions.Then there is an MILP M(x, y; C 1 ) that consists of variable vectors x ∈ D (⊆ R k ), y ∈ R, and an auxiliary variable vector z ∈ R p for some integer p = O(n A + n B ) and a set C 1 of O(n A + n B ) constraints on these variables such that: ψ N (x * ) = y * if and only if there is a vector (x * , y * ) feasible to M(x, y; C 1 ).
See Appendix A for the set of constraints to define our AD D in the MILP M(x, y; C 1 ) in Theorem 1.
To attain the admissibility of inferred vector x * , we also introduce a variable vector g ∈ R q for some integer q and a set C 2 of constraints on x and g such that x * ∈ A holds in the following sense: The Phase 2 consists of the next two steps.

Phase 2.
Step 4: Formulate Problem (II-a) as the above MILP M(x, y, g; C 1 , C 2 ) based on G and N .Find a set detects that no such vector x exists.
Step 5: To solve Problem (II-b), enumerate all graphs G * ∈ G such that f (G * ) = x * for each vector x * ∈ F * .See Figure 7 for an illustration of Step 5.In this paper, we set a graph class G to be the set of rank-2 graphs.In Step 4, we solve an MILP M(x, g; C 2 ) that is formulated on a novel idea of representing rank-2 chemical graphs, as will be discussed in Section 2.3.2.In Step 5, we use branch-and-bound algorithms for enumerating rank-2 chemical compounds [25,26].

Representing Rank-2 Chemical Graphs
This section introduces a method of modeling rank-2 chemical graphs with different cyclic structures in a unified way and proposes an MILP formulation that represents a rank-2 chemical graph G of n vertices.

Scheme Graphs and Tree-Extensions
Given positive integers n * and p, a graph with n * vertices and p edges can be represented as a subgraph of a complete graph K n * with n * (n * − 1)/2 edges.However, formulating this as an MILP may require to prepare Ω((n * ) 2 ) variables and constraints.To reduce the number of variables and constraints in an MILP that represents a rank-2 graph, we decompose a rank-2 graph G into the core and non-core of G so that the core is represented by one of the three rank-2 polymer topologies and the non-core is a collection of trees in which the height is bounded by the core height of G.We do not specify how many subtrees will be attached to each edge in the polymer topology in advance, since otherwise we would need a different MILP for a distinct combination of such assignments of subtrees.Instead we allow each edge in a polymer topology to collect a necessary number of subtrees in our MILP (see the next section for more detail).In this section, we introduce a "scheme graph" to represent three possible rank-2 polymer topologies, an "extension" of the scheme graph to represent the core of a rank-2 graph and a "tree-extension" to represent a combination of the core and non-core of a rank-2 graph, so that any of the three kinds of rank-2 polymer topologies can be selected in a single MILP formulation.

Scheme Graphs
Formally, we define the scheme graph for rank 2 to be a pair (K, E ) of a multigraph K and an ordered partition E = (E 1 , E 2 , E 3 ) of the edge set E(K). Figure 2d illustrates the scheme graph ).An edge in E 1 is called a semi-edge, an edge in E 2 is called a virtual edge and an edge in E 3 is called a real edge.

Extensions of Scheme Graphs
Based on the scheme graph (K, E ), we construct the core of a rank-2 graph H as an "extension," which is defined as follows (see also Figure 8).The extension H core in Figure 9a An extension of the scheme graph (K, E ) is defined to be a simple graph obtained from K by using each real edge e = uv ∈ E 3 , by eliminating or replacing each virtual edge e = uv ∈ E 2 (resp., semi-edge e = uv ∈ E 1 ) with a u, v-path of length at least two (resp., 1) in the core of H, where a u, v-path of length 1 means an edge uv. Figure 9a illustrates an extension H core of the scheme graph (K, E ) which is obtained by removing virtual edges a 4 , a 5 ∈ E 2 and by replacing semi-edge a 1 ∈ E 1 with a path (u 1,1 , v 1,1 , v 2,1 , u 4,1 ), semi-edge a 2 ∈ E 1 with a path (u 2,1 , v 3,1 , v 4,1 , v 5,1 , u 3,1 ) and by using semi-edge a 3 ∈ E 1 and real edges a 6 , a 7 ∈ E 3 .The extension H core in Figure 9a is isomorphic to the core of the rank-2 graph H in Figure 9b.Observe that each of the least simple graphs S(M i ), i = 1, 2, 3 in Figure 2 is obtained as an extension of the scheme graph (K, E ) in Figure 2d.

Tree-Extensions
Let s * = |V(K)| = 4 denote the number of vertices in the scheme graph.For non-negative integers a, b and c, we consider a rank-2 graph H such that cs(H) = s * + a = 4 + a, ch(H) = b and the maximum degree of a core vertex is at most c.We define an "(a, b, c)-tree-extension" as a minimal supergraph of all such rank-2 graphs H. Formally, the (a, b, c)-tree-extension (or a tree-extension) is defined to be the graph obtained by augmenting the graph K as follows: For each s ∈ [1, s * ], let the root of rooted tree S s be equal to the vertex u s and denote by u s,i the copy of the i-th vertex of T(c − 2, c − 1, b) in S s (see Figure 8a).(ii) Create a new path (v 1,1 , v 2,1 , . . ., v a,1 ) with a vertices, where the edge between v t,1 and v t+1,1 is denoted by e t+1 (see Figure 8c).For each t ∈ [1, a], create a copy T t of the rooted tree T(c − 2, c − 1, b), let the root of rooted tree T t be equal to the vertex v 1,1 and denote by v t,i the copy of the i-th vertex of T(c − 2, c − 1, b) in T t (see Figure 8b).(iii) For every pair (s, t) with s ∈ [1, s * ] and t ∈ [1, a], join vertices u s,1 and v t,1 with an edge u s,1 v t,1 (see Figure 8c).
Figure 8 illustrates the (3, 2, 4)-tree-extension of the scheme graph.We show how a rank-2 graph can be constructed as a subgraph of a tree-extension with some example.Figure 9b illustrates a rank-2 graph H with n(H) = 21, cs(H) = 9, ch(H) = 2 and θ(H) = 1, where the maximum degree of a non-core vertex is 3. To prepare a tree-extension so that the graph H can be a subgraph of the tree-extension, we set cs * := cs(H), a := t * := cs * − s * = 5, b := ch * := ch(H) = 2 and c := d max := 3. Figure 9c illustrates a subgraph H of the (t * = 5, ch * = 2, d max = 3)-tree-extension such that H is isomorphic to the rank-2 graph H.

MILPs for Rank-2 Chemical Graphs
We present an outline of our MILP M(x, g; C 2 ) in Step 4 of the framework.For integers d max , n * , cs * , ch * , θ * ∈ Z, let H(d max , n * , cs * , ch * , θ * ) denote the set of rank-2 graphs H such that the degree of each core vertex is at most 4, the degree of each non-core vertex is at most d max , n(H) = n * , cs(H) = cs * , ch(H) = ch * and θ(H) = θ * .In this paper, we obtain the following result.Note that our MILP requires only O(n * ) variables and constraints when the maximum core height of a subtree in the non-core of G * and |Γ| are constant.We formulate an MILP in Theorem 2 so that such a graph H is selected as a subgraph of the scheme graph.
We explain the basic idea of our MILP.Define where n tree and n in are the numbers of vertices and non-leaf vertices in the rooted tree T(d max − 2, d max − 1, ch * ), respectively.The MILP mainly consists of the following three types of constraints.

1.
Constraints for selecting a rank-2 graph H as a subgraph of the (t * , ch * , d max )-tree-extension of the scheme graph (K, E ); 2.
Constraints for assigning chemical elements to vertices and multiplicity to edges to determine a chemical graph G = (H, α, β); 3.
Constraints for computing descriptors from the selected rank-2 chemical graph G; and 4.
Constraints for reducing the number of rank-2 chemical graphs that are isomorphic to each other but can be represented by the above constraints.
In the constraints of 1, we treat each edge in the tree-extension as a directed edge because describing some condition for H to belong to H(d max , n * , cs * , ch * , θ * ) becomes slightly easier than the case of undirected graphs.More formally we prepare the following.

(i)
In the scheme graph (K, E ), denote the edges in Based on these, we include constraints with some more additional variables so that a selected subgraph H is a connected rank-2 graph.See constraints Equations (A10) to (A42) in Appendix A for the details.
In the constraints of 3, we introduce a variable for each descriptor and constraints with some more variables to compute the value of each descriptor in f (G) for a selected chemical graph G. See constraints Equations (A62) to (A113) in Appendix A for the details.
With constraints 1 to 3, our MILP formulation already represents a rank-2 chemical graph G and a feature vector x ∈ R k so that x = f (G) holds.In the constraints of 4, we include some additional constraints so that the search space required for an MILP solver to solve an instance of our MILP problem is reduced.For this, we consider a graph-isomorphism of rooted subtrees of each tree S s or T s and define a canonical form among subtrees that are isomorphic to each other.We try to eliminate a chemical graph G that has a subtree in S s or T s that is not a canonical form.See constraints Equations (A114) to (A119) in Appendix A for the details.

Results
We implemented our method of Steps 1 to 5 for inferring rank-2 chemical graphs and conducted experiments to evaluate the computational efficiency for three chemical properties π: octanol/water partition coefficient (KOW), melting point (MP), and boiling point (BP).We executed the experiments on a PC with Intel Core i5 1.6 GHz CPU and 8GB of RAM running under the Mac OS operating system version 10.14.6.We show 2D drawings of some of the inferred chemical graphs, where ChemDoodle version 10.2.0 is used for constructing the drawings.

Results on Phase 1.
Step 1.We set a graph class G to be the set of all rank-2 chemical graphs.For each property π ∈ {KOW, MP, BP}, we select a set Λ of chemical elements and collected a data set D π on rank-2 chemical graphs over Λ provided by HSDB from PubChem.To construct the data set, we eliminated chemical compounds that have at most three carbon atoms or contain a charged element such as N + or an element a ∈ Λ in which the valence is different from our setting of valence function val.
Table 1 shows the size and range of data sets that we prepared for each chemical property in Step 1, where we denote the following:  Step 2. We used a feature function f that consists of the descriptors defined in Section 2.1.
Step 3. We used scikit-learn version 0.21.6 with Python 3.7.4 to construct ANNs N where the tool and activation function are set to be MLPRegressor and ReLU, respectively.We tested several different architectures of ANNs for each chemical property.To evaluate the performance of the resulting prediction function ψ N with cross-validation, we partition a given data set D π into five subsets D π is used for a test set in five trials i ∈ [1,5].For a set {y 1 , y 2 , . . ., y N } of observed values and a set {ψ 1 , ψ 2 , . . ., ψ N } of predicted values, we define the coefficient of determination to be R 2  For each chemical property π, we selected the ANN N that attained the best test R 2 score among the five ANNs to formulate an MILP M(x, y, z; C 1 ) in the second phase.We implemented Steps 4 and 5 in Phase 2 as follows.
Step 4. In this step, we solve the MILP M(x, y, g; C 1 , C 2 ) formulated based on the ANN N obtained in Phase 1.To solve an MILP in Step 4, we use CPLEX version 12.10.In our experiment, we choose a target value y * ∈ [a, a] and fix or bound some descriptors in our feature vector as follows: -Fix variable θ that represents the polymer parameter θ(H) to be each integer in {−2, 0, 2}; -Set d max to be each of 3 and 4; -Fix n * to be some four integers in {15, 19, 20, 25, 30} for θ ∈ {−2, 0} and {15, 19, 20, 22, 25} for θ = 2; -Choose three integers from [7,16] and fix cs * to be each of the three integers; -Fix ch * to be each of the four integers in [2,5].
Based on the above setting, we generated 12 instances for each n * .We set ε = 0.02 in Step 4. Tables 3  Figure 11a-c illustrate some rank-2 chemical graphs G * with θ(G * ) = 0 constructed from the vector g * obtained by solving the MILP in Step 4.

Appendix A.2. Construction of Scheme Graph and Tree-Extension
We infer a subgraph H such that the maximum degree is d max ∈ {3, 4}, n(H) = n * , cs(H) = cs * and ch(H) = ch * .For this, we first construct the (t * , ch * , d max )-tree-extension of the scheme graph (K = (V K = {u 1 , . . ., u s * }, E K = {a 1 , a 2 , . . ., a m }), E = (E 1 , E 2 , E 3 )).We use the following notations: For j ∈ [1, 3] and s ∈ [1, s * ], let E + j (s) (resp., E − j (s)) denote the set of indices i of edges a i ∈ E i such that the tail (resp., head) of a i is u s,1 .Let E + j,k (s) E + j (s) ∪ E + k (s), E − j,k (s) E − j (s) ∪ E − k (s), E j (s) E + j (s) ∪ E − j (s) and E j,k (s) E j (s) ∪ E k (s).As described in Section 2.3.1, some edge a(i) ∈ E 1 ∪ E 2 may be replaced with a subpath P i of (v 1,1 , v 1,2 , . . ., v t * ,1 ), which consists of the roots of trees T 1 , T 2 , . . ., T t * .We assign color i to the vertices in such a subpath P i by setting a variable χ(t) of each vertex v t,1 ∈ V(P i ) to be i.For each edge u s,1 v t,1 , we prepare a binary variable e(s, t) to denote that edge u s,1 v t,1 is used (resp., not used) in a selected graph H when e(s, t) = 1 (resp., e(s, t) = 0).We also include constraints necessary for the variables to satisfy a degree condition at each of the vertices u s,1 , s ∈ [1, s * ] and v t,1 , t ∈ [1, t * ]. variables: a(i) ∈ {0, 1}, i ∈ E 1 ∪ E 3 : a(i) represents edge a i ∈ E 1 ∪ E 3 (a(i) = 1, i ∈ E 1 ) (a(i) = 1 ⇔ edge a i is used in H);
]): the number of vertices of degree i in G; -ce co a (G) (a ∈ Λ): the number of core vertices with chemical element a ∈ Λ; -ce nc a (G) (a ∈ Λ): the number of non-core vertices with chemical element a ∈ Λ; -ms(G): the average of mass * of atoms in G; -b co k (G) (k ∈ [2, 3]): the number of double and triple bonds in core edges; -b nc k (G) (k ∈ [2, 3]): the number of double and triple bonds in non-core edges; -ac co γ (G) (γ = (a, b, k) ∈ Γ): the number of adjacency-configurations (a, b, k) of core edges; -ac nc γ (G) (γ = (a, b, k) ∈ Γ): the number of adjacency-configurations (a, b, k) of non-core edges; θ(H): the topological parameter of H; and n H (G): the number of hydrogen atoms to be included in G; i.e., n H (G) = ∑ a∈Λ val(a)(ce co a

Figure 3 .
Figure 3.An illustration of Step 1: A data set D π of chemical graphs G i , i = 1, 2, . . ., m in a class G of graphs whose values a(G i ) ∈ [a, a] of a chemical property π are available.

Figure 4 .
Figure 4.An illustration of Step 2: Each chemical graph G ∈ G is mapped to a vector f (G) in a feature vector space R k for some positive integer k.

Figure 5 .
Figure 5.An illustration of Step 3: A prediction function ψ N from the feature vector space R k to the range [a, a] is constructed based on an ANN N .
for a tolerance ε set to be a small positive real.See Figure6for an illustration of Step 4.

2 Figure 7 .
Figure 7.An illustration of Step 5: For each vector x * ∈ F * , all chemical graphs G * ∈ G such that f (G * ) = x * are generated.

Figure 8 .
Figure 8.An illustration of a tree-extension, where the vertices in V(K) are depicted with gray circles: (a) The structure of the rooted tree S s rooted at a vertex u s,1 ; (b) the structure of the rooted tree T t rooted at a vertex v t,1 ; (c) the (a, b, c)-tree-extension of the scheme graph in Figure 2d for a = t * = 3, b = ch * = 2 and c = d max = 4.

- π :
one of the chemical properties KOW, MP and BP; -|D π |: the size of data set D π for property π; -Λ: the set of chemical elements over data set D π (hydrogen atoms are added at the final stage); -|Γ|: the number of tuples in Γ; -[n, n]: the minimum and maximum number n(G) of non-hydrogen atoms over data set D π ; -[cs, cs], [ch, ch]: the minimum and maximum core size and core height over chemical compounds in D π , respectively; -[θ, θ]: the minimum and maximum values of the topological parameter θ(G) over data set D π ; and -[a, a]: the minimum and maximum values of a(G) in π over data set D π .
∈ [1, 5] randomly, where D π \ D (i) π is used for a training set and D (i) -8 show the results of Step 4 for d max = 3 and 4, respectively, where we denote the following: -y * π : a target value in [a, a] for a property π; -n * : a specified number of vertices in [n, n]; -|F * |/#I: #I means the number of MILP instances in Step 4 (where #I = 12), and |F * | means the size of set F * of vectors x * generated from all feasible instances among the #I MILP instances in Step 4; -IP-time: the average time (sec.) to solve one of the #I MILP instances to find a set F * of vectors x * .

Figure 10a -
Figure 10a-c illustrate some rank-2 chemical graphs G * with θ(G * ) = −2 constructed from the vector g * obtained by solving the MILP in Step 4.Figure11a-c illustrate some rank-2 chemical graphs G * with θ(G * ) = 0 constructed from the vector g * obtained by solving the MILP in Step 4.

Figure 12a -
Figure 12a-c illustrate some rank-2 chemical graphs G * with θ(G * ) = 2 constructed from the vector g * obtained by solving the MILP in Step 4.

Table 1 .
Results of Step 1 in Phase 1.

Table 2 .
Results of Steps 2 and 3 in Phase 1.
A Self-archived copy in Kyoto University Research Information Repository https://repository.kulib.kyoto-u.ac.jp