Next Article in Journal
PUB-SalNet: A Pre-Trained Unsupervised Self-Aware Backpropagation Network for Biomedical Salient Segmentation
Next Article in Special Issue
Compression of Next-Generation Sequencing Data and of DNA Digital Files
Previous Article in Journal
Moving Deep Learning to the Edge
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Novel Method for Inference of Chemical Compounds of Cycle Index Two with Desired Properties Based on Artificial Neural Networks and Integer Programming

1
Department of Applied Mathematics and Physics, Kyoto University, Kyoto 606-8501, Japan
2
Bioinformatics Center, Institute for Chemical Research, Kyoto University, Uji 611-0011, Japan
*
Authors to whom correspondence should be addressed.
These authors contributed equally to this work.
Algorithms 2020, 13(5), 124; https://doi.org/10.3390/a13050124
Submission received: 22 April 2020 / Revised: 13 May 2020 / Accepted: 13 May 2020 / Published: 18 May 2020
(This article belongs to the Special Issue 2020 Selected Papers from Algorithms Editorial Board Members)

Abstract

:
Inference of chemical compounds with desired properties is important for drug design, chemo-informatics, and bioinformatics, to which various algorithmic and machine learning techniques have been applied. Recently, a novel method has been proposed for this inference problem using both artificial neural networks (ANN) and mixed integer linear programming (MILP). This method consists of the training phase and the inverse prediction phase. In the training phase, an ANN is trained so that the output of the ANN takes a value nearly equal to a given chemical property for each sample. In the inverse prediction phase, a chemical structure is inferred using MILP and enumeration so that the structure can have a desired output value for the trained ANN. However, the framework has been applied only to the case of acyclic and monocyclic chemical compounds so far. In this paper, we significantly extend the framework and present a new method for the inference problem for rank-2 chemical compounds (chemical graphs with cycle index 2). The results of computational experiments using such chemical properties as octanol/water partition coefficient, melting point, and boiling point suggest that the proposed method is much more useful than the previous method.

1. Introduction

Inference of chemical compounds with desired properties is important for computer-aided drug design. Since drug design is one of the major targets of chemo-informatics and bioinformatics, it is also important in these areas. Indeed, this problem has been extensively studied in chemo-informatics under the name of inverse QSAR/QSPR [1,2], where QSAR/QSPR denotes Quantitative Structure Activity/Property Relationships. Since chemical compounds are usually represented as undirected graphs, this problem is important also from graph theoretic and algorithmic viewpoints.
Inverse QSAR/QSPR is often formulated as an optimization problem to find a chemical graph maximizing (or minimizing) an objective function under various constraints, where objective functions reflect certain chemical activities or properties. In many cases, objective functions are derived from a set of training data consisting of known molecules and their activities/properties using statistical and machine learning methods.
In both forward and inverse QSAR/QSPR, chemical graphs are often represented as vectors of real or integer numbers because it is difficult to directly handle graphs using statistical and machine learning methods. Elements of these vectors are called descriptors in QSAR/QSPR studies, and these vectors correspond to feature vectors in machine learning. Using these chemical descriptors, various heuristic and statistical methods have been developed for finding optimal or nearly optimal graph structures under given objective functions [1,3,4]. In many cases, inference or enumeration of graph structures from a given feature vector is a crucial subtask in these methods. Various methods have been developed for this enumeration problem [5,6,7,8] and the computational complexity of the inference problem has been analyzed [9,10,11]. On the other hand, enumeration in itself is a challenging task, since the number of molecules (i.e., chemical graphs) with up to 30 atoms (vertices) C, N, O, and S, may exceed 10 60 [12].
As in many other fields, Artificial Neural Network (ANN) and deep learning technologies have recently been applied to inverse QSAR/QSPR. For example, variational autoencoders [13], recurrent neural networks [14,15], and grammar variational autoencoders [16] have been applied. In these approaches, new chemical graphs are generated by solving a kind of inverse problems on neural networks, where neural networks are trained using known chemical compound/activity pairs. However, the optimality of the solution is not necessarily guaranteed in these approaches. In order to guarantee the optimality, a novel approach has been proposed [17] for ANNs with ReLU activation functions and sigmoid activation functions, using mixed integer linear programming (MILP). In their approach, activation functions on neurons are efficiently encoded as piece-wise linear functions so as to represent ReLU functions exactly and sigmoid functions approximately.
Recently, a new framework has been proposed [18,19,20] by combining two previous approaches; efficient enumeration of tree-like graphs [5], and MILP-based formulation of the inverse problem on ANNs [17]. This combined framework for inverse QSAR/QSPR mainly consists of two phases, one for constructing a prediction function to a chemical property, and the other for constructing graphs based on the inverse of the prediction function. The first phase solves (I) Prediction Problem, where a prediction function ψ N on a chemical property π is constructed with an ANN N using a data set of chemical compounds G and their values a ( G ) of π . The second phase solves (II) Inverse Problem, where (II-a) given a target value y * of the chemical property π , a feature vector x * is inferred from the trained ANN N so that ψ N ( x * ) is close to y * and (II-b) then a set of chemical structures G * such that f ( G * ) = x * is enumerated. In (II-b) of the above-mentioned previous methods [18,19,20], an MILP is formulated for acyclic chemical compounds. Their methods were applicable only to acyclic chemical graphs (i.e., tree-structured chemical graphs), where the ratio of acyclic chemical graphs in a major chemical database (PubChem) is 2.91%. Afterward, Ito et al. [21] designed a method of inferring monocyclic chemical graphs (chemical graphs with cycle index or rank 1) by formulating a new MILP and using an efficient algorithm for enumerating monocyclic chemical graphs [22]. This still leaves a big limitation because the ratio of acyclic and monocyclic chemical graphs in the chemical database PubChem is only 16.26%.
To break this limitation, we significantly extend the MILP-based approach for inverse QSAR/QSPR so that “rank-2 chemical compounds” (chemical graphs with cycle index or rank 2) can be efficiently handled, where the ratio of chemical graphs with rank at most 2 in the database PubChem is 44.5%. Note that there are three different topological structures, called polymer-topologies over all rank-2 chemical compounds. In particular, we propose a novel MILP formulation for (II-a) along with a new set of descriptors. One big advantage of this new formulation is that an MILP instance has a solution if and only if there exists a rank-2 chemical graph satisfying given constraints, which is useful to significantly reduce redundant search in (II-b). We conducted computational experiments to infer rank-2 chemical compounds on several chemical properties.
The paper is organized as follows. Section 2.1 introduces some notions on graphs, a modeling of chemical compounds, and a choice of descriptors. Section 2.2 reviews the framework for inferring chemical compounds based on ANNs and MILPs. Section 2.3 introduces a method of modeling rank-2 chemical graphs with different cyclic structures in a unified way and proposes an MILP formulation that represents a rank-2 chemical graph G of n vertices, where our MILP requires only O ( n ) variables and constraints when the maximum height of subtrees in G is constant. Section 3 reports the results on some computational experiments conducted for chemical properties such as octanol/water partition coefficient, melting point, and boiling point. Section 4 makes some concluding remarks. Appendix A provides the detail of all variables and constraints in our MILP formulation.

2. Materials and Methods

2.1. Preliminary

This section introduces some notions and terminology on graphs, a modeling of chemical compounds, and our choice of descriptors.

2.1.1. Multigraphs and Graphs

Let R and Z denote the sets of reals and non-negative integers, respectively. For two integers a and b, let [ a , b ] denote the set of integers i with a i b .

Multigraphs

A multigraph is defined to be a pair ( V , E ) of a vertex set V and an edge set E such that each edge e E joins two vertices u , v V (possibly u = v ) and the vertices u and v are called the end-vertices of the edge e, and let V ( e ) denote the set of the end-vertices of an edge e E , where an edge e with | V ( e ) | = 1 is called a loop. We denote the vertex and edge sets of a multigraph M by V ( M ) and E ( M ) , respectively. A path with end-vertices u and v is called a u , v -path, and the length of a path is defined to be the number of edges in the path.
Let M be a multigraph. An edge e E ( M ) is called multiple (to an edge e E ( M ) ) if there is another edge e E ( M ) with V ( e ) = V ( e ) . For a vertex v V ( M ) , the set of neighbors of v in M is denoted by N M ( v ) , and the degree deg M ( v ) of v is defined to be the number of times an edge in E ( M ) is incident to v; i.e., deg M ( v ) = | { e E ( M ) v V ( e ) , | V ( e ) | = 2 } | + 2 | { e E ( M ) v V ( e ) , | V ( e ) | = 1 } | . A multigraph is called simple if it has no loop and there is at most one edge between any two vertices. We observe that the sum of the degrees over all vertices is twice the number of edges in any multigraph M; i.e.,
2 | E ( M ) | = v V ( M ) deg M ( v ) .
For a subset X of vertices in M, let M X denote the multigraph obtained from M by removing the vertices in X and any edge incident to a vertex in X. An operation of subdividing a non-loop edge (resp., loop) e E ( M ) with V ( e ) = { v 1 , v 2 } (resp., V ( e ) = { v 1 = v 2 } ) is to replace e with two new edges e 1 and e 2 incident to a new vertex v e such that each e i is incident to v i . An operation of contracting a vertex u of degree 2 in M is to replace the two edges u v and u v incident to u with a single edge v v removing vertex u, where the resulting edge is a loop when v = v . The rank r ( M ) of a multigraph M is defined to be the minimum number of edges to be removed to make the multigraph acyclic. We call a multigraph M with r ( M ) = k a rank-k graph. Let V deg , i ( M ) denote the set of vertices of degree i in M. The core Cr ( M ) of M is defined to be an induced subgraph M * that is obtained from M : = M by setting M : = M V deg , 1 ( M ) repeatedly until M * contains at most two vertices or consists of vertices of degree at least 2. The core M * of a connected multigraph M consists of a single vertex (resp., two vertices) if and only if M is a tree with an even (resp., odd) diameter. A vertex (resp., an edge) in M is called a core vertex (resp., core edge) if it is contained in the core of M and is called a non-core vertex (resp., non-core edge) otherwise. The core size cs ( M ) is defined to be the number of core vertices of M, and the core height ch ( M ) is defined to be the maximum length of a path between a vertex v V ( M * ) to a leaf of M without passing through any core edge. The set of non-core edges induces a collection of subtrees, each of which we call a non-core component of M, where each non-core component C contains exactly one core vertex v C and we regard C as a tree rooted at v C . Let C be a non-core component of M. The height height ( v ) of a vertex v in C is defined to be the maximum length of a path from v to a leaf u in the descendants of v.
A multigraph is called a polymer topology if it is connected and the degree of every vertex is at least 3. Tezuka and Oike [23] pointed out that a classification of polymer topologies will lay a foundation for elucidation of structural relationships between different macro-chemical molecules and their synthetic pathways. For integers r 0 and d 3 , let PT ( r , d ) denote the set of all rank-r polymer topologies with maximum degree at most d. Figure 1 illustrates the three rank-2 polymer topologies in PT ( 2 , 4 ) .
For a polymer topology M, the least simple graph S ( M ) of M is defined to be a simple graph obtained from M by subdividing each loop in M with two new vertices of degree 2 and subdividing all multiple edges (except for one) between every two adjacent vertices in M. Note that | V ( S ( M ) ) | = | V ( M ) | + r + s for the rank r of M and the number s of loops in M.
The polymer topology Pt ( M ) of a multigraph M with r ( M ) 2 is defined to be a multigraph M of degree at least 3 that is obtained from the core Cr ( M ) by contracting all vertices of degree 2. Note that r ( Pt ( M ) ) = r ( M ) . Figure 2a–c illustrate the least simple graph S ( M ) of each polymer topology M PT ( 2 , 4 ) , where Figure 2d illustrates a graph that contains all least simple graphs.

Graphs

Let H = ( V , E ) be a graph with a set V of vertices and a set E of edges. Define the 1-path connectivity κ 1 ( H ) of H to be u v E 1 / deg H ( u ) deg H ( v ) .
Let H be a rank-2 connected graph such that the maximum degree is at most 4. We see that H contains two vertices v a and v b such that either there are three disjoint paths between v a and v b or H contains two edge disjoint cycles C and C , which are joined with a path between v a and v b (possibly v a = v b ). We introduce the topological parameter θ ( H ) of rank-2 connected graph H as follows. When H has three disjoint paths between v a and v b , define θ ( H ) to be the minimum number of edges along a path between v a and v b . When H contains two edge disjoint cycles C and C , which are joined with a path P between v a and v b (possibly v a = v b ), define θ ( H ) to be | E ( P ) | .
For positive integers a , b and c with b 2 , let T ( a , b , c ) denote the rooted tree such that the number of children of the root is a, the number of children of each non-root internal vertex is b and the distance from the root to each leaf is c. In the rooted tree T ( a , b , c ) , we denote the vertices by v 1 , v 2 , , v n ( n = a ( b c 1 ) / ( b 1 ) + 1 ) with a breadth-first-search order, and denote the edge between a vertex v i with i [ 2 , n ] and its parent by e i . For each vertex v i in T ( a , b , c ) , let Cld ( i ) denote the set of indices j such that v j is a child of v i , and prt ( i ) denote the index j such that v j is the parent of v i when i [ 2 , n ] .

2.1.2. Modeling of Chemical Compounds

Chemical Graphs

We represent the graph structure of a chemical compound as a graph with labels on vertices and multiplicity on edges in a hydrogen-suppressed model. Nearly 68.5% (resp., 99%) of the rank-2 chemical graphs with at most 200 non-hydrogen atoms registered in chemical database PubChem have a maximum degree at most 3 (resp., 4) for all non-core vertices in the hydrogen-suppressed model.
Let Λ be a set of labels, each of which represents a chemical element such as C (carbon), O (oxygen), N (nitrogen) and so on, where we assume that Λ does not contain H (hydrogen). Let mass ( a ) and val ( a ) denote the mass and valence of a chemical element a Λ , respectively. In our model, we use integers mass * ( a ) = 10 · mass ( a ) , a Λ and assume that each chemical element a Λ has a unique valence val ( a ) [ 1 , 4 ] .
We introduce a total order < over the elements in Λ according to their mass values; i.e., we write a < b for chemical elements a , b Λ with mass ( a ) < mass ( b ) . Choose a set Γ < of tuples γ = ( a , b , k ) Λ × Λ × [ 1 , 3 ] such that a < b . For a tuple γ = ( a , b , k ) Λ × Λ × [ 1 , 3 ] , let γ ¯ denote the tuple ( b , a , k ) . Set Γ > = { γ ¯ γ Γ < } , Γ = = { ( a , a , k ) a Λ , k [ 1 , 3 ] } and Γ = Γ < Γ = . a pair of two atoms a and b joined with a bond of multiplicity k is denoted by a tuple γ = ( a , b , k ) Γ , called  the adjacency-configuration of the atom pair.
We use a hydrogen-suppressed model because hydrogen atoms can be added at the final stage. a chemical graph over Λ and Γ is defined to be a tuple G = ( H , α , β ) of a graph H = ( V , E ) , a function α : V Λ and a function β : E [ 1 , 3 ] such that
(i)
H is connected;
(ii)
u v E β ( u v ) val ( α ( u ) ) for each vertex u V ; and
(iii)
( α ( u ) , α ( v ) , β ( u v ) ) Γ for each edge u v E .
Let G ( Λ , Γ ) denote the set of chemical graphs over Λ and Γ .

Descriptors

In our method, we use only graph-theoretical descriptors for defining a feature vector, which facilitates our designing an algorithm for constructing graphs. Given a chemical graph G = ( H , α , β ) , we define a feature vector f ( G ) that consists of the following 14 kinds of descriptors:
-
n ( G ) : the number of vertices in G;
-
cs ( G ) : the core size of G;
-
ch ( G ) : the core height of G;
-
κ 1 ( G ) : the 1-path connectivity of G;
-
dg i ( G ) ( i [ 1 , 4 ] ): the number of vertices of degree i in G;
-
ce a co ( G ) ( a Λ ) : the number of core vertices with chemical element a Λ ;
-
ce a nc ( G ) ( a Λ ) : the number of non-core vertices with chemical element a Λ ;
-
ms ¯ ( G ) : the average of mass * of atoms in G;
-
b k co ( G ) ( k [ 2 , 3 ] ): the number of double and triple bonds in core edges;
-
b k nc ( G ) ( k [ 2 , 3 ] ): the number of double and triple bonds in non-core edges;
-
ac γ co ( G ) ( γ = ( a , b , k ) Γ ): the number of adjacency-configurations ( a , b , k ) of core edges;
-
ac γ nc ( G ) ( γ = ( a , b , k ) Γ ): the number of adjacency-configurations ( a , b , k ) of non-core edges;
-
θ ( H ) : the topological parameter of H; and
-
n H ( G ) : the number of hydrogen atoms to be included in G; i.e.,
n H ( G ) = a Λ val ( a ) ( ce a co ( G ) + ce a nc ( G ) ) 2 ( n ( G ) + 1 + b 2 co ( G ) + b 2 nc ( G ) + 2 b 3 co ( G ) + 2 b 3 nc ( G ) ) .
The number k of descriptors in our feature vector x = f ( G ) is k = 2 | Λ | + 2 | Γ | + 15 .

2.2. A Method for Inferring Chemical Graphs

This section reviews the framework that solves the inverse QSAR/QSPR by using MILPs [18]. For a specified chemical property π such as boiling point, we denote by a ( G ) the observed value of the property π for a chemical compound G. As the Phase 1, we solve (I) Prediction Problem with the following three steps.
Phase 1.
Step 1: Let DB be a set of chemical graphs. For a specified chemical property π , choose a class G of graphs such as acyclic graphs or monocyclic graphs. Prepare a data set D π = { G i i = 1 , 2 , , m } G DB such that the value a ( G i ) of each chemical graph G i , i = 1 , 2 , , m is available. Set reals a ̲ , a ¯ R so that a ̲ a ( G i ) a ¯ , i = 1 , 2 , , m . See Figure 3 for an illustration of Step 1.
Step 2: Introduce a feature function f : G R k for a positive integer k. We call f ( G ) the feature vector of G G , and call each entry of a vector f ( G ) a descriptor of G. See Figure 4 for an illustration of Step 2.
Step 3: Construct a prediction function ψ N with an ANN N that, given a vector in R k , returns a real in the range [ a ̲ , a ¯ ] so that ψ N ( f ( G ) ) takes a value nearly equal to a ( G ) for many chemical graphs in D. See Figure 5 for an illustration of Step 3.
Next we explain how to solve the inverse problem to the prediction in Phase 1 using an MILP formulation. A vector x R k is called admissible if there is a graph G G such that f ( G ) = x [18]. Let A denote the set of admissible vectors x R k . In this paper, we use the range-based method to define an applicability domain (AD) [24] to our inverse QSAR/QSPR. Set x j ̲ and x j ¯ to be the minimum and maximum values of the j-th descriptor x j in f ( G i ) over all graphs G i , i = 1 , 2 , , m (where we possibly normalize some descriptors such as ce a co ( G ) , which is normalized with ce a co ( G ) / n ( H ) ). Define our AD D to be the set of vectors x R k such that x j ̲ x j x j ¯ for the variable x j of each j-th descriptor, j = 1 , 2 , , k . As the second phase, we solve (II) Inverse Problem for the inverse QSAR/QSPR by treating the following inference problems.
(II-a) Inference of Vectors
Input: A real y * [ a ̲ , a ¯ ] .
Output: Vectors x * A D and g * R h such that ψ N ( x * ) = y * and g * forms a chemical graph G * G with f ( G * ) = x * .
(II-b) Inference of Graphs
Input: A vector x * A D .
Output: All graphs G * G such that f ( G * ) = x * .
To treat Problem (II-a), we use MILPs for inferring vectors in ANNs [17]. In MILPs, we can easily impose additional linear constraints or fix some variables to specified constants. We include into the MILP a linear constraint such that x D to obtain the next result.
Theorem 1.
Let N be an ANN with a piecewise-linear activation function for an input vector x R k , n A denote the number of nodes in the architecture and n B denote the total number of break-points over all activation functions. Then there is an MILP M ( x , y ; C 1 ) that consists of variable vectors x D ( R k ) , y R , and an auxiliary variable vector z R p for some integer p = O ( n A + n B ) and a set C 1 of O ( n A + n B ) constraints on these variables such that: ψ N ( x * ) = y * if and only if there is a vector ( x * , y * ) feasible to M ( x , y ; C 1 ) .
See Appendix A for the set of constraints to define our AD D in the MILP M ( x , y ; C 1 ) in Theorem 1.
To attain the admissibility of inferred vector x * , we also introduce a variable vector g R q for some integer q and a set C 2 of constraints on x and g such that x * A holds in the following sense: ( x * , g * ) is feasible to the MILP M ( x , g ; C 2 ) if and only if g * forms a chemical graph G * G with f ( G * ) = x * . The Phase 2 consists of the next two steps.
Phase 2.
Step 4: Formulate Problem (II-a) as the above MILP M ( x , y , g ; C 1 , C 2 ) based on G and N . Find a set F * of vectors x * A D such that ( 1 ε ) y * ψ N ( x * ) ( 1 + ε ) y * for a tolerance ε set to be a small positive real. See Figure 6 for an illustration of Step 4.
Step 5: To solve Problem (II-b), enumerate all graphs G * G such that f ( G * ) = x * for each vector x * F * . See Figure 7 for an illustration of Step 5.
In this paper, we set a graph class G to be the set of rank-2 graphs. In Step 4, we solve an MILP M ( x , g ; C 2 ) that is formulated on a novel idea of representing rank-2 chemical graphs, as will be discussed in Section 2.3.2. In Step 5, we use branch-and-bound algorithms for enumerating rank-2 chemical compounds [25,26].

2.3. Representing Rank-2 Chemical Graphs

This section introduces a method of modeling rank-2 chemical graphs with different cyclic structures in a unified way and proposes an MILP formulation that represents a rank-2 chemical graph G of n vertices.

2.3.1. Scheme Graphs and Tree-Extensions

Given positive integers n * and p, a graph with n * vertices and p edges can be represented as a subgraph of a complete graph K n * with n * ( n * 1 ) / 2 edges. However, formulating this as an MILP may require to prepare Ω ( ( n * ) 2 ) variables and constraints. To reduce the number of variables and constraints in an MILP that represents a rank-2 graph, we decompose a rank-2 graph G into the core and non-core of G so that the core is represented by one of the three rank-2 polymer topologies and the non-core is a collection of trees in which the height is bounded by the core height of G. We do not specify how many subtrees will be attached to each edge in the polymer topology in advance, since otherwise we would need a different MILP for a distinct combination of such assignments of subtrees. Instead we allow each edge in a polymer topology to collect a necessary number of subtrees in our MILP (see the next section for more detail). In this section, we introduce a “scheme graph” to represent three possible rank-2 polymer topologies, an “extension” of the scheme graph to represent the core of a rank-2 graph and a “tree-extension” to represent a combination of the core and non-core of a rank-2 graph, so that any of the three kinds of rank-2 polymer topologies can be selected in a single MILp formulation.

Scheme Graphs

Formally, we define the scheme graph for rank 2 to be a pair ( K , E ) of a multigraph K and an ordered partition E = ( E 1 , E 2 , E 3 ) of the edge set E ( K ) . Figure 2d illustrates the scheme graph ( K = ( { u 1 , u 2 , u 3 , u 4 } , E ) , E = ( E 1 , E 2 , E 3 ) ) . An edge in E 1 is called a semi-edge, an edge in E 2 is called a virtual edge and an edge in E 3 is called a real edge.

Extensions of Scheme Graphs

Based on the scheme graph ( K , E ) , we construct the core of a rank-2 graph H as an “extension,” which is defined as follows (see also Figure 8). The extension H core in Figure 9a An extension of the scheme graph ( K , E ) is defined to be a simple graph obtained from K by using each real edge e = u v E 3 , by eliminating or replacing each virtual edge e = u v E 2 (resp., semi-edge e = u v E 1 ) with a u , v -path of length at least two (resp., 1) in the core of H, where a u , v -path of length 1 means an edge u v . Figure 9a illustrates an extension H core of the scheme graph ( K , E ) which is obtained by removing virtual edges a 4 , a 5 E 2 and by replacing semi-edge a 1 E 1 with a path ( u 1 , 1 , v 1 , 1 , v 2 , 1 , u 4 , 1 ) , semi-edge a 2 E 1 with a path ( u 2 , 1 , v 3 , 1 , v 4 , 1 , v 5 , 1 , u 3 , 1 ) and by using semi-edge a 3 E 1 and real edges a 6 , a 7 E 3 . The extension H core in Figure 9a is isomorphic to the core of the rank-2 graph H in Figure 9b. Observe that each of the least simple graphs S ( M i ) , i = 1 , 2 , 3 in Figure 2 is obtained as an extension of the scheme graph ( K , E ) in Figure 2d.

Tree-Extensions

Let s * = | V ( K ) | = 4 denote the number of vertices in the scheme graph. For non-negative integers a, b and c, we consider a rank-2 graph H such that cs ( H ) = s * + a = 4 + a , ch ( H ) = b and the maximum degree of a core vertex is at most c. We define an “ ( a , b , c ) -tree-extension” as a minimal supergraph of all such rank-2 graphs H. Formally, the ( a , b , c ) -tree-extension (or a tree-extension) is defined to be the graph obtained by augmenting the graph K as follows:
(i)
For each vertex u s V ( K ) , s [ 1 , s * ] , create a copy S s of the rooted tree T ( c 2 , c 1 , b ) . For each s [ 1 , s * ] , let the root of rooted tree S s be equal to the vertex u s and denote by u s , i the copy of the i-th vertex of T ( c 2 , c 1 , b ) in S s (see Figure 8a).
(ii)
Create a new path ( v 1 , 1 , v 2 , 1 , , v a , 1 ) with a vertices, where the edge between v t , 1 and v t + 1 , 1 is denoted by e t + 1 (see Figure 8c). For each t [ 1 , a ] , create a copy T t of the rooted tree T ( c 2 , c 1 , b ) , let the root of rooted tree T t be equal to the vertex v 1 , 1 and denote by v t , i the copy of the i-th vertex of T ( c 2 , c 1 , b ) in T t (see Figure 8b).
(iii)
For every pair ( s , t ) with s [ 1 , s * ] and t [ 1 , a ] , join vertices u s , 1 and v t , 1 with an edge u s , 1 v t , 1 (see Figure 8c).
Figure 8 illustrates the ( 3 , 2 , 4 ) -tree-extension of the scheme graph. We show how a rank-2 graph can be constructed as a subgraph of a tree-extension with some example. Figure 9b illustrates a rank-2 graph H with n ( H ) = 21 , cs ( H ) = 9 , ch ( H ) = 2 and θ ( H ) = 1 , where the maximum degree of a non-core vertex is 3. To prepare a tree-extension so that the graph H can be a subgraph of the tree-extension, we set cs * : = cs ( H ) , a : = t * : = cs * s * = 5 , b : = ch * : = ch ( H ) = 2 and c : = d max : = 3 . Figure 9c illustrates a subgraph H of the ( t * = 5 , ch * = 2 , d max = 3 ) -tree-extension such that H is isomorphic to the rank-2 graph H.

2.3.2. MILPs for Rank-2 Chemical Graphs

We present an outline of our MILP M ( x , g ; C 2 ) in Step 4 of the framework. For integers d max , n * , cs * , ch * , θ * Z , let H ( d max , n * , cs * , ch * , θ * ) denote the set of rank-2 graphs H such that the degree of each core vertex is at most 4, the degree of each non-core vertex is at most d max , n ( H ) = n * , cs ( H ) = cs * , ch ( H ) = ch * and θ ( H ) = θ * . In this paper, we obtain the following result.
Theorem 2.
Let Λ be a set of chemical elements, Γ be a set of adjacency-configurations, where | Λ | | Γ | , and k = 2 | Λ | + 2 | Γ | + 15 . Given integers d max { 3 , 4 } , n * 3 , cs * 3 ch * 0 and θ * , there is an MILP M ( x , g ; C 2 ) that consists of variable vectors x R k and g R q for some integer q = O ( | Γ | · cs * · ( d max 1 ) ch * ) and a set C 2 of O ( | Γ | + cs * · ( d max 1 ) ch * ) constraints on these variables such that: ( x * , g * ) is feasible to M ( x , g ; C 2 ) if and only if g * forms a rank-2 chemical graph G * = ( H , α , β ) G ( Λ , Γ ) such that H H ( d max , n * , cs * , ch * , θ * ) and f ( G * ) = x * .
Note that our MILP requires only O ( n * ) variables and constraints when the maximum core height of a subtree in the non-core of G * and | Γ | are constant. We formulate an MILP in Theorem 2 so that such a graph H is selected as a subgraph of the scheme graph.
We explain the basic idea of our MILP. Define
t * cs * s * ,
c * | E 1 E 2 | f o r ( K , E = ( E 1 , E 2 , E 3 ) ) ,
n tree 1 + 2 ( ( d max 1 ) ch * 1 ) / ( d max 2 ) a n d n in 1 + 2 ( ( d max 1 ) ch * 1 1 ) / ( d max 2 ) ,
where n tree and n in are the numbers of vertices and non-leaf vertices in the rooted tree T ( d max 2 , d max 1 , ch * ) , respectively. The MILP mainly consists of the following three types of constraints.
  • Constraints for selecting a rank-2 graph H as a subgraph of the ( t * , ch * , d max ) -tree-extension of the scheme graph ( K , E ) ;
  • Constraints for assigning chemical elements to vertices and multiplicity to edges to determine a chemical graph G = ( H , α , β ) ;
  • Constraints for computing descriptors from the selected rank-2 chemical graph G; and
  • Constraints for reducing the number of rank-2 chemical graphs that are isomorphic to each other but can be represented by the above constraints.
In the constraints of 1, we treat each edge in the tree-extension as a directed edge because describing some condition for H to belong to H ( d max , n * , cs * , ch * , θ * ) becomes slightly easier than the case of undirected graphs. More formally we prepare the following.
(i)
In the scheme graph ( K , E ) , denote the edges in E 1 E 2 E 3 by E 1 = { a 1 , a 2 , , a | E 1 | } , E 2 = { a | E 1 | + 1 , , a c * } and E 3 = { a c * + 1 , , a m } (where c * = | E 1 E 2 | ), and regard each edge a i = u s , 1 u s , 1 E 1 E 2 E 3 as a directed edge from one end-vertex u s , 1 to the other end-vertex u s , 1 with s < s . Let a ( i ) be a binary variable for each edge a i , i [ 1 , m ] .
(ii)
In each tree S s (resp., T t ) in the tree-extension, we regard each edge e s , i , i 2 in the rooted tree S s , s [ 1 , s * ] (resp., e t , i , i 2 in the rooted tree T t , t [ 1 , t * ] ) as a directed edge from vertex u s , prt ( i ) to vertex u s , i (resp., from vertex v t , prt ( i ) to vertex v t , i ). Let u ( s , i ) (resp., v ( t , i ) ) be a binary variable for vertex u s , i , s [ 1 , s * ] (resp., t [ 1 , t * ] ) and i [ 1 , n tree ] ;
(iii)
In the path P t * consisting of the roots of trees T t , [ t 1 , t * ] , we regard each edge e t , t [ 2 , t * ] as a directed edge from vertex v t 1 , 1 to vertex v t , 1 ; and
(iv)
We regard each edge u s , 1 v t , 1 for s [ 1 , s * ] and t [ 1 , t * ] as two directed edges, one directed from vertex u s , 1 to vertex v t , 1 and the other directed oppositely. Let e ( s , t ) (resp., e ( t , s ) ) be a binary variable of directed edge ( u s , 1 , v t , 1 ) (resp., ( v t , 1 , u s , 1 ) ).
Based on these, we include constraints with some more additional variables so that a selected subgraph H is a connected rank-2 graph. See constraints Equations (A10) to (A42) in Appendix A for the details.
In the constraints of 2, we prepare an integer variable α ˜ ( u ) for each vertex u in the tree-extension that represents the chemical element α ( u ) Λ if u is in a selected graph H (or α ˜ ( u ) = 0 otherwise) and an integer variable β ˜ ( e ) [ 0 , 3 ] (resp., β ^ ( e ) [ 0 , 3 ] ) for each edge e (resp., e = e ( s , t ) or e ( t , s ) , s [ 1 , s * ] , t [ 1 , t * ] ) in the tree-extension that represents the multiplicity β ( e ) [ 1 , 3 ] if e is in a selected graph H (or β ˜ ( e ) or β ^ ( e ) takes 0 otherwise). This determines a chemical graph G = ( H , α , β ) . Also we include constraints for a selected chemical graph G to satisfy the valence condition ( α ( u ) , α ( v ) , β ( u v ) ) Γ for each edge u v E . See constraints Equations (A43) to (A61) in Appendix A for the details.
In the constraints of 3, we introduce a variable for each descriptor and constraints with some more variables to compute the value of each descriptor in f ( G ) for a selected chemical graph G. See constraints Equations (A62) to (A113) in Appendix A for the details.
With constraints 1 to 3, our MILP formulation already represents a rank-2 chemical graph G and a feature vector x R k so that x = f ( G ) holds. In the constraints of 4, we include some additional constraints so that the search space required for an MILP solver to solve an instance of our MILP problem is reduced. For this, we consider a graph-isomorphism of rooted subtrees of each tree S s or T s and define a canonical form among subtrees that are isomorphic to each other. We try to eliminate a chemical graph G that has a subtree in S s or T s that is not a canonical form. See constraints Equations (A114) to (A119) in Appendix A for the details.

3. Results

We implemented our method of Steps 1 to 5 for inferring rank-2 chemical graphs and conducted experiments to evaluate the computational efficiency for three chemical properties π : octanol/water partition coefficient (Kow), melting point (Mp), and boiling point (Bp). We executed the experiments on a PC with Intel Core i5 1.6 GHz CPU and 8GB of RAM running under the Mac OS operating system version 10.14.6. We show 2D drawings of some of the inferred chemical graphs, where ChemDoodle version 10.2.0 is used for constructing the drawings.
Results on Phase 1.
Step 1. We set a graph class G to be the set of all rank-2 chemical graphs. For each property π {Kow, Mp, Bp}, we select a set Λ of chemical elements and collected a data set D π on rank-2 chemical graphs over Λ provided by HSDB from PubChem. To construct the data set, we eliminated chemical compounds that have at most three carbon atoms or contain a charged element such as N + or an element a Λ in which the valence is different from our setting of valence function val .
Table 1 shows the size and range of data sets that we prepared for each chemical property in Step 1, where we denote the following:
-
π : one of the chemical properties Kow, Mp and Bp;
-
| D π | : the size of data set D π for property π ;
-
Λ : the set of chemical elements over data set D π (hydrogen atoms are added at the final stage);
-
| Γ | : the number of tuples in Γ ;
-
[ n ̲ , n ¯ ] : the minimum and maximum number n ( G ) of non-hydrogen atoms over data set D π ;
-
[ cs ̲ , cs ¯ ] , [ ch ̲ , ch ¯ ] : the minimum and maximum core size and core height over chemical compounds in D π , respectively;
-
[ θ ̲ , θ ¯ ] : the minimum and maximum values of the topological parameter θ ( G ) over data set D π ; and
-
[ a ̲ , a ¯ ] : the minimum and maximum values of a ( G ) in π over data set D π .
Step 2. We used a feature function f that consists of the descriptors defined in Section 2.1.
Step 3. We used scikit-learn version 0.21.6 with Python 3.7.4 to construct ANNs N where the tool and activation function are set to be MLPRegressor and ReLU, respectively. We tested several different architectures of ANNs for each chemical property. To evaluate the performance of the resulting prediction function ψ N with cross-validation, we partition a given data set D π into five subsets D π ( i ) , i [ 1 , 5 ] randomly, where D π D π ( i ) is used for a training set and D π ( i ) is used for a test set in five trials i [ 1 , 5 ] . For a set { y 1 , y 2 , , y N } of observed values and a set { ψ 1 , ψ 2 , , ψ N } of predicted values, we define the coefficient of determination to be R 2 1 j [ 1 , N ] ( y j ψ j ) 2 j [ 1 , N ] ( y j y ¯ ) 2 , where y ¯ = 1 N j [ 1 , N ] y j . Table 2 shows the results on Steps 2 and 3, where
-
k: the number of descriptors for the chemical compounds in data set D π for property π ;
-
Activation: the choice of activation function;
-
Architecture: ( a , b , 1 ) consists of an input layer with a nodes, a hidden layer with b nodes, and an output layer with a single node, where a is equal to the number of descriptors;
-
L-time: the average time (sec.) to construct ANNs for each trial;
-
test R 2 (ave.): the average of coefficient of determination over the five test sets; and
-
test R 2 (best): the largest value of coefficient of determination over the five test sets.
For each chemical property π , we selected the ANN N that attained the best test R 2 score among the five ANNs to formulate an MILP M ( x , y , z ; C 1 ) in the second phase.
Results on Phase 2.
We implemented Steps 4 and 5 in Phase 2 as follows.
Step 4. In this step, we solve the MILP M ( x , y , g ; C 1 , C 2 ) formulated based on the ANN N obtained in Phase 1. To solve an MILP in Step 4, we use CPLEX version 12.10. In our experiment, we choose a target value y * [ a ̲ , a ¯ ] and fix or bound some descriptors in our feature vector as follows:
-
Fix variable θ that represents the polymer parameter θ ( H ) to be each integer in { 2 , 0 , 2 } ;
-
Set d max to be each of 3 and 4;
-
Fix n * to be some four integers in { 15 , 19 , 20 , 25 , 30 } for θ { 2 , 0 } and { 15 , 19 , 20 , 22 , 25 } for θ = 2 ;
-
Choose three integers from [ 7 , 16 ] and fix cs * to be each of the three integers;
-
Fix ch * to be each of the four integers in [ 2 , 5 ] .
Based on the above setting, we generated 12 instances for each n * . We set ε = 0.02 in Step 4.
Table 3, Table 4, Table 5, Table 6, Table 7 and Table 8 show the results of Step 4 for d max = 3 and 4, respectively, where we denote the following:
-
y π * : a target value in [ a ̲ , a ¯ ] for a property π ;
-
n * : a specified number of vertices in [ n ̲ , n ¯ ] ;
-
| F * | / # I: #I means the number of MILP instances in Step 4 (where #I = 12), and | F * | means the size of set F * of vectors x * generated from all feasible instances among the #I MILP instances in Step 4;
-
IP-time: the average time (sec.) to solve one of the #I MILP instances to find a set F * of vectors x * .
Figure 10a–c illustrate some rank-2 chemical graphs G * with θ ( G * ) = 2 constructed from the vector g * obtained by solving the MILP in Step 4.
Figure 11a–c illustrate some rank-2 chemical graphs G * with θ ( G * ) = 0 constructed from the vector g * obtained by solving the MILP in Step 4.
Figure 12a–c illustrate some rank-2 chemical graphs G * with θ ( G * ) = 2 constructed from the vector g * obtained by solving the MILP in Step 4.
Step 5. In this step, we modified the algorithms proposed by Tamura et al. [25] and Yamashita et al. [26] to enumerate all rank-2 graphs G * G such that f ( G * ) = x * for each x * F * . We stop the execution when either the total number of graphs inferred over all vectors x * F * exceeds 100 or the execution time exceeds one hour.
Table 3, Table 4, Table 5, Table 6, Table 7 and Table 8 show the results on Step 5 for d max = 3 and 4, respectively,
-
# G * : the number of all (or up to 100) rank-2 chemical graphs G * that are computed under 1 h time limit in Step 5, where f ( G * ) = x * for some x * F * . (Note that | F * | such graphs G * have been found in Step 4, and Figure 10, Figure 11 and Figure 12 illustrate some of such graphs G * .);
-
G-time: the running time (sec.) to execute Step 5, where “>1 h” means that the execution time exceeds the limit.
We also conducted some additional experiments to demonstrate that our MILP-based method is flexible to control conditions on the inference of chemical graphs. In Step 3, we constructed an ANN N π for each of the three chemical properties π { Kow, Mp, Bp}, and formulated the inverse problem of each ANN N π as an MILP M π . Since the set of descriptors is common to all three properties Kow, Mp, and Bp, it is possible to infer a rank-2 chemical graph G * that satisfies a target value y π * for each of the three properties at the same time (if one exists). We specify the size of graph so that n : = 22 , core size := 14, core height := 3, θ : = 2 and d max : = 3 , and set target values with y K ow * : = 5 , y Mp * : = 150 and y Bp * : = 250 in an MILP that consists of the three MILPs M K ow , M Mp and M Bp . The MILP was solved in 268.11 (sec) and we obtained a rank-2 chemical graph G * illustrated in Figure 10d.

4. Discussion

In this paper, we proposed a new method for the inverse QSAR/QSPR to rank-2 chemical graphs by significantly enhancing the framework due to Azam et al. [18], Zhang et al. [20], and Ito et al. [21], and implemented it for inferring rank-2 chemical graphs using the algorithms for enumerating rank-2 chemical graphs due to Tamura et al. [25] and Yamashita et al. [26]. From the results on some computational experiments, we observe that the proposed method runs efficiently for an instance with n * 30 non-hydrogen atoms up to Step 4 and an instance with n * 15 non-hydrogen atoms up to Step 5. Due to this development, the ratio of chemical compounds covered in the PubChem database increased from 16.26% to 44.5%. It is left as future work to apply our new method for the inverse QSAR/QSPR to a wider class of graphs. The ratio of the number of chemical graphs with rank at most 3 (resp., 4) to the number of all chemical graphs in database PubChem is 68.8% (resp., 84.7%). Among rank-4 chemical compounds, Remdesivir C 27 H 35 N 6 O 8 P , an antiviral medication, which is being studied as a possible post-infection treatment for COVID-19, has a chemical graph G with r ( G ) = 4 , n ( G ) = 42 , cs ( G ) = 24 , and ch ( G ) = 8 . The number of polymer topologies with rank 3 (resp., 4) such that the maximum degree is at most 4 is 12 (resp., 73). Our MILP formulation can be easily extended to the case of rank 3 or 4 by replacing the current set of constraints for the scheme graph with a set of those for a new scheme graph that is designed for rank-3 or -4 polymer topologies.

Author Contributions

Conceptualization, H.N. and T.A.; methodology, H.N.; software, J.Z., C.W., and A.S.; validation, J.Z., C.W., A.S., and H.N.; formal analysis, H.N.; data resources, H.N. and T.A.; writing—original draft preparation, H.N.; writing—review and editing, T.A.; project administration, H.N.; funding acquisition, T.A. All authors have read and agreed to the published version of the manuscript.

Funding

H.N. and T.A. were partially supported by the Japan Society for the Promotion of Science, Japan, under Grant #18H04113.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. All Constraints in an MILP Formulation for Rank-2 Chemical Graphs

To formulate an MILP that represents a chemical graph G = ( H , α , β ) , we distinguish a tuple ( a , b , k ) from a tuple ( b , a , k ) . For a tuple γ = ( a , b , k ) Λ × Λ × { 1 , 2 , 3 } , let γ ¯ denote the tuple ( b , a , k ) . Let Γ < { γ ¯ γ Γ > } . We call a tuple γ = ( a , b , k ) Λ × Λ × { 1 , 2 , 3 } proper if
k min { val ( a ) , val ( b ) } a n d   k max { val ( a ) , val ( b ) } 1 ,
where the latter is assumed because otherwise G must consist of two atoms of a = b . Assume that each tuple γ Γ is proper. Let ϵ be a fictitious chemical element that represents null, call a tuple ( a , b , 0 ) with a , b Λ { ϵ } fictitious, and define Γ 0 to be the set of all fictitious tuples; i.e., Γ 0 = { ( a , b , 0 ) a , b Λ { ϵ } } . To represent chemical elements e Λ { ϵ } Γ in an MILP, we encode these elements e into some integers denoted by [ e ] . Assume that, for each element a Λ , [ a ] is a positive integer and that [ ϵ ] = 0 .

Appendix A.1. Applicability Domain

We use the range-based method to define an applicability domain for our method. For this, we find the range (the minimum and maximum) of each descriptor over all relevant chemical compounds and represent each range as a set of linear constraints in the constraint set C 1 of our MILP formulation. Recall that D π stands for a set of chemical graphs used for constructing a prediction function. However, the number of examples in D π may not be large enough to capture a general feature on the structure of chemical graphs. For this, we also use some data set from the whole set DB of chemical graphs in a database. Let DB G ( i ) denote the set of chemical graphs G DB G such that n ( G ) = i for each integer i 1 . Formally the set of variables and constraints is given as follows.
AD constraints in C 1 :
constants:
Integers cs * 3 and ch * 1 ; An integer d max { 3 , 4 } ;
An integer n * [ cs * + 1 , cs * · ( d max 1 ) ch * ] ;
variables for descriptors in x:
A real variable κ 1 0 : κ 1 represents κ 1 ( H ) ;
dg ( i ) [ 0 , n * ] ( i [ 1 , 4 ] ) : dg ( i ) represents the number of vertices of degree i in H;
Mass Z : Mass represents v V mass * ( α ( v ) ) ;
ce co ( a ) [ 0 , n * ] , a Λ : ce co ( a ) represents the number of vertices of chemical element
a in the core of H;
ce nc ( a ) [ 0 , n * ] , a Λ : ce nc ( a ) represents the number of vertices of chemical element
a in the non-core of H;
b co ( k ) [ 0 , 2 n * ] , k [ 1 , 3 ] : b co ( k ) represents the number of k-bonds in the core of H;
b nc ( k ) [ 0 , 2 n * ] , k [ 1 , 3 ] : b nc ( k ) represents the number of k-bonds in the non-core of H;
ac co ( γ ) [ 0 , n * ] , γ Γ < Γ = : ac co ( γ ) represents the number of core edges
in H that are assigned tuple γ Γ < ;
ac nc ( γ ) [ 0 , n * ] , γ Γ < Γ = : ac nc ( γ ) represents the number of non-core edges in
H that are assigned tuple γ Γ < ;
constraints:
n * min G D π DB G ( n * ) κ 1 ( G ) n ( G ) κ 1 n * max G D π DB G ( n * ) κ 1 ( G ) n ( G ) ,
n * min G D π DB G ( n * ) dg i ( G ) n ( G ) dg ( i ) n * max G D π DB G ( n * ) dg i ( G ) n ( G ) , i [ 1 , 4 ] ,
n * min G D π DB G ( n * ) ms ¯ ( G ) Mass n * max G D π DB G ( n * ) ms ¯ ( G ) ,
n * min G D π DB G ( n * ) ce a co ( G ) n ( G ) ce co ( a ) n * max G D π DB G ( n * ) ce a co ( G ) n ( G ) , a Λ ,
n * min G D π DB G ( n * ) ce a nc ( G ) n ( G ) ce nc ( a ) n * max G D π DB G ( n * ) ce a nc ( G ) n ( G ) , a Λ ,
( n * + 1 ) min G D π DB G ( n * ) b k co ( G ) n ( G ) + 1 b co ( k ) ( n * + 1 ) max G D π DB G ( n * ) b k co ( G ) n ( G ) + 1 , k [ 2 , 3 ] ,
( n * + 1 ) min G D π DB G ( n * ) b k nc ( G ) n ( G ) + 1 b nc ( k ) ( n * + 1 ) max G D π DB G ( n * ) b k nc ( G ) n ( G ) + 1 , k [ 2 , 3 ] ,
( n * + 1 ) min G D π DB G ( n * ) ac γ co ( G ) n ( G ) + 1 ac co ( γ ) ( n * + 1 ) max G D π DB G ( n * ) ac γ co ( G ) n ( G ) + 1 , γ Γ ,
( n * + 1 ) min G D π DB G ( n * ) ac γ nc ( G ) n ( G ) + 1 ac nc ( γ ) ( n * + 1 ) max G D π DB G ( n * ) ac γ nc ( G ) n ( G ) + 1 , γ Γ .
In the following, we derive an MILP M ( x , g ; C 2 ) that satisfies the condition in Theorem 2. Let d max { 3 , 4 } , n * 3 , cs * 3 ch * 0 and θ * be given integers. We describe the set C 2 with several sets of constraints.

Appendix A.2. Construction of Scheme Graph and Tree-Extension

We infer a subgraph H such that the maximum degree is d max { 3 , 4 } , n ( H ) = n * , cs ( H ) = cs * and ch ( H ) = ch * . For this, we first construct the ( t * , ch * , d max ) -tree-extension of the scheme graph ( K = ( V K = { u 1 , , u s * } , E K = { a 1 , a 2 , , a m } ) , E = ( E 1 , E 2 , E 3 ) ) . We use the following notations: For j [ 1 , 3 ] and s [ 1 , s * ] , let E j + ( s ) (resp., E j ( s ) ) denote the set of indices i of edges a i E i such that the tail (resp., head) of a i is u s , 1 . Let E j , k + ( s ) E j + ( s ) E k + ( s ) , E j , k ( s ) E j ( s ) E k ( s ) , E j ( s ) E j + ( s ) E j ( s ) and E j , k ( s ) E j ( s ) E k ( s ) .
As described in Section 2.3.1, some edge a ( i ) E 1 E 2 may be replaced with a subpath P i of ( v 1 , 1 , v 1 , 2 , , v t * , 1 ) , which consists of the roots of trees T 1 , T 2 , , T t * . We assign color i to the vertices in such a subpath P i by setting a variable χ ( t ) of each vertex v t , 1 V ( P i ) to be i. For each edge u s , 1 v t , 1 , we prepare a binary variable e ( s , t ) to denote that edge u s , 1 v t , 1 is used (resp., not used) in a selected graph H when e ( s , t ) = 1 (resp., e ( s , t ) = 0 ). We also include constraints necessary for the variables to satisfy a degree condition at each of the vertices u s , 1 , s [ 1 , s * ] and v t , 1 , t [ 1 , t * ] .
constants:
Integers s * = | V K | , c * = | E 1 E 2 | , cs * ( s * ) , n * ( cs * ) and ch * 0 ;
d ̲ + ( s ) , s [ 1 , s * ] : a lower bound on the out-degree of vertex u s , 1 in H;
d ̲ ( s ) , s [ 1 , s * ] : a lower bound on the in-degree of vertex u s , 1 in H;
d ¯ + ( s ) , s [ 1 , s * ] : an upper bound on the out-degree of vertex u s , 1 in H;
d ¯ ( s ) , s [ 1 , s * ] : an upper bound on the in-degree of vertex u s , 1 in H;
variables:
a ( i ) { 0 , 1 } , i E 1 E 3 : a ( i ) represents edge a i E 1 E 3 ( a ( i ) = 1 , i E 1 )
( a ( i ) = 1 ⇔ edge a i is used in H);
e ( s , t ) , e ( t , s ) { 0 , 1 } , s [ 1 , s * ] , t [ 1 , t * ] : e ( s , t ) (resp., e ( t , s ) ) represents
direction ( u s , 1 , v t , 1 ) (resp., ( v t , 1 , u s , 1 ) ), where e ( s , t ) = 1 (resp., e ( t , s ) = 1 ) ⇔
edge u s , 1 , v t , 1 is used in H and direction ( u s , 1 , v t , 1 ) (resp., ( v t , 1 , u s , 1 ) ) is assigned
to edge u s , 1 , v t , 1 ;
χ ( t ) [ 1 , c * ] , t [ 1 , t * ] : χ ( t ) represents the color assigned to vertex v t , 1
( χ ( t ) = c ⇔ vertex v t , 1 is assigned color c);
clr ( c ) [ 0 , n * s * ] , c [ 1 , c * ] : the number of vertices v t , i with color c;
deg co + ( s ) [ 1 , 4 ] , s [ 1 , s * ] : the out-degree of vertex u s , 1 in the core of H;
deg co ( s ) [ 1 , 4 ] , s [ 1 , s * ] : the in-degree of vertex u s , 1 in the core of H;
δ clr ( t , c ) { 0 , 1 } , t [ 1 , t * ] , c [ 1 , c * ] ( δ clr ( t , c ) = 1 χ ( t ) = c );
constraints:
c [ 1 , c * ] δ clr ( t , c ) = 1 , t [ 1 , t * ] ,
c [ 1 , c * ] c · δ clr ( t , c ) = χ ( t ) , t [ 1 , t * ] ,
t [ 1 , t * ] δ clr ( t , c ) = clr ( c ) , c [ 1 , c * ] ,
e ( s , t ) + e ( t , s ) 1 , s [ 1 , s * ] , t [ 1 , t * ] ,
s [ 1 , s * ] { head ( c ) } e ( t , s ) 1 δ clr ( t , c ) , c [ 1 , c * ] , t [ 1 , t * ] ,
s [ 1 , s * ] { tail ( c ) } e ( s , t ) 1 δ clr ( t , c ) , c [ 1 , c * ] , t [ 1 , t * ] ,
i E 1 , 3 ( s ) a ( i ) + t [ 1 , t * ] e ( t , s ) = deg co ( s ) , s [ 1 , s * ] ,
i E 1 , 3 + ( s ) a ( i ) + t [ 1 , t * ] e ( s , t ) = deg co + ( s ) , s [ 1 , s * ] ,
d ̲ + ( s ) deg co + ( s ) d ¯ + ( s ) , s [ 1 , s * ] ,
d ̲ ( s ) deg co ( s ) d ¯ ( s ) , s [ 1 , s * ] .

Appendix A.3. Specification for Chemical Graphs with Rank 2

To generate any of the three rank-2 polymer topologies in PT ( 2 , 4 ) , we use the scheme graph ( K = ( V K = { u 1 , u 2 , u 3 , u 4 } , E K ) , E = ( E 1 , E 2 , E 3 ) ) in Figure 2d, where s * = | V ( K ) | = 4 , c * = | E 1 E 2 | = 5 , E 1 = { a 1 = ( u 1 , u 4 ) , a 2 = ( u 2 , u 3 ) , a 3 = ( u 2 , u 4 ) } , E 2 = { a 4 = ( u 1 , u 2 ) , a 5 = ( u 3 , u 4 ) } and E 3 = { a 6 = ( u 1 , u 2 ) , a 7 = ( u 3 , u 4 ) } . Recall that each color i [ 1 , c * ] is assigned to edge a i E 1 E 2 . We impose some more constraints on the degree of each of the vertices u s , 1 , s [ 1 , s * ] and v t , 1 , t [ 1 , t * ] so that the core of a selected graph H satisfies one of the three least simple graphs in Figure 2a–c. We also let a variable θ mean the topological parameter θ ( H ) of a selected subgraph H.
constants:
s * = 4 , c * = 5 ,
E 1 ( 1 ) = , E 2 ( 1 ) = , E 3 ( 1 ) = , E 1 + ( 1 ) = { 1 } , E 2 + ( 1 ) = { 4 } , E 3 + ( 1 ) = { 6 } ,
E 1 ( 2 ) = , E 2 ( 2 ) = { 4 } , E 3 ( 2 ) = { 6 } , E 1 + ( 2 ) = { 2 , 3 } , E 2 + ( 2 ) = , E 3 + ( 2 ) = ,
E 1 ( 3 ) = { 2 } , E 2 ( 3 ) = , E 3 ( 3 ) = , E 1 + ( 3 ) = , E 2 + ( 3 ) = { 5 } , E 3 + ( 3 ) = { 7 } ,
E 1 ( 4 ) = { 1 , 3 } , E 2 ( 4 ) = { 5 } , E 3 ( 4 ) = { 7 } , E 1 + ( 4 ) = , E 2 + ( 4 ) = , E 3 + ( 4 ) = ,
d ̲ ( 1 ) = 0 , d ¯ ( 1 ) = 0 , d ̲ + ( 1 ) = 2 , d ¯ + ( 1 ) = 2 ,
d ̲ ( 2 ) = 1 , d ¯ ( 2 ) = 2 , d ̲ + ( 2 ) = 1 , d ¯ + ( 2 ) = 2 ,
d ̲ ( 3 ) = 1 , d ¯ ( 3 ) = 1 , d ̲ + ( 3 ) = 1 , d ¯ + ( 3 ) = 2 ,
d ̲ ( 4 ) = 2 , d ¯ ( 4 ) = 3 , d ̲ + ( 4 ) = 0 , d ¯ + ( 4 ) = 0 ,
variables:
θ [ n * , n * ] : The topology-parameter θ ( H ) for rank 2;
constraints:
a ( 2 ) + clr ( 2 ) 1 ,
a ( 3 ) + clr ( 3 ) + clr ( 4 ) 1 ,
clr ( 4 ) clr ( 5 ) ,
clr ( 3 ) clr ( 2 ) + 1 ,
clr ( 3 ) clr ( 1 ) + 1 + n * ( 3 deg co ( 4 ) ) ,
θ 1 + clr ( 2 ) + n * ( 2 deg co + ( 3 ) ) ,
θ 1 + clr ( 2 ) n * ( 2 deg co + ( 3 ) ) ,
θ n * ( 4 deg co + ( 2 ) deg co ( 2 ) ) ,
θ n * ( 4 deg co + ( 2 ) deg co ( 2 ) ) ,
θ 1 + clr ( 3 ) + n * ( 3 deg co ( 4 ) ) ,
θ 1 + clr ( 3 ) n * ( 3 deg co ( 4 ) ) .

Appendix A.4. Selecting A Subgraph

We prepare a binary variable u ( s , i ) (resp., v ( t , i ) ) for each vertex u s , i in tree S s (resp., v t , i in tree T t ). We include constraints so that the path ( v 1 , 1 , v 1 , 2 , , v t * , 1 ) is partitioned into subpaths P c , c [ 1 , c * ] , where possibly some P c is empty, and the resulting subgraph H becomes a connected rank-2 graph with n ( H ) = n * , cs ( H ) = cs * , ch ( H ) = ch * and θ ( H ) = θ * .
constants:
Integers d max { 3 , 4 } , ch * 0 ;
Prepare the set Cld ( i ) of the indices of children of a vertex v i
the index prt ( i ) of the parent of a non-root vertex v i , and
the set Dst ( h ) of indices i such that the height of a vertex v i is h
in the rooted tree T ( 2 , d max 1 , ch * ) ;
variables:
u ( s , i ) { 0 , 1 } , s [ 1 , s * ] , i [ 1 , n tree ] : u ( s , i ) represents vertex u s , i
( u ( s , i ) = 1 ⇔ vertex u s , i is used in H and edge e s , i ( i 2 ) is used in H);
v ( t , i ) { 0 , 1 } , t [ 1 , t * ] , i [ 1 , n tree ] : v ( t , i ) represents vertex v t , i
( v ( t , i ) = 1 ⇔ vertex v t , i is used in H and edge e t , i ( i 2 ) is used in H);
e ( t ) { 0 , 1 } , t [ 1 , t * + 1 ] : e ( t ) represents edge e t = v t 1 , 1 v t , i ,
where e 1 , 1 and e t * + 1 , 1 are fictitious edges ( e ( t ) = 1 ⇔ edge e t is used in H);
constraints:
u ( s , 1 ) = 1 , s [ 1 , s * ] ,
d max · u ( t , i ) j Cld ( i ) u ( t , j ) , t [ 1 , cs * ] , i [ 2 , n in ] ,
v ( t , 1 ) = 1 , t [ 1 , t * ] ,
d max · v ( t , i ) j Cld ( i ) v ( t , j ) , t [ 1 , cs * ] , i [ 2 , n in ] ,
s [ 1 , s * ] , i [ 1 , n tree ] u ( s , i ) + t [ 1 , t * ] , i [ 1 , n tree ] v ( t , i ) = n * ,
s [ 1 , s * ] , i Dst ( ch * ) u ( s , i ) + t [ 1 , t * ] , i Dst ( ch * ) v ( t , i ) 1 ,
e ( 1 ) = e ( t * + 1 ) = 0 ,
e ( t + 1 ) + s [ 1 , s * ] e ( t , s ) = 1 , t [ 1 , t * ] ,
e ( t ) + s [ 1 , s * ] e ( s , t ) = 1 , t [ 1 , t * ] ,
c * χ ( 1 ) χ ( 2 ) χ ( t * ) 1 ,
e ( t + 1 ) 1 + χ ( t + 1 ) χ ( t ) , t [ 1 , t * 1 ] ,
c * · ( 1 e ( t + 1 ) ) χ ( t ) χ ( t + 1 ) , t [ 1 , t * 1 ] .

Appendix A.5. Assigning Multiplicity

We prepare an integer variable β ˜ ( e ) or β ^ ( e ) for each edge e in the ( t * , ch * , d max ) -tree-extension of the scheme graph to denote the multiplicity of e in a selected graph H and include necessary constraints for the variables to satisfy in H.
variables:
β ˜ ( i ) [ 0 , 3 ] , i E 1 E 3 : β ˜ ( i ) represents the multiplicity of edge a i ,
where β ˜ ( i ) = 0 if edge a i is not in H;
β ˜ ( p , i ) [ 0 , 3 ] , p [ 1 , cs * ] , i [ 2 , n tree ] : β ˜ ( p , i ) with p s * (resp., p > s * ) represents
the multiplicity of edge e p , i (resp., e p s * , i );
β ˜ ( t , 1 ) [ 0 , 3 ] , t [ 1 , t * + 1 ] : β ˜ ( t , 1 ) represents the multiplicity of edge e t ;
β ^ ( s , t ) [ 0 , 3 ] , s [ 1 , s * ] , t [ 1 , t * ] : β ^ ( s , t ) represents the multiplicity of edge u s , 1 v t , 1 ;
constraints:
a ( i ) = 1 , i E 3 ,
a ( i ) β ˜ ( i ) 3 a ( i ) , i E 1 E 3 ,
u ( s , i ) β ˜ ( s , i ) 3 u ( s , i ) , s [ 1 , s * ] , i [ 2 , n tree ] ,
v ( t , i ) β ˜ ( s * + t , i ) 3 v ( t , i ) , t [ 1 , t * ] , i [ 2 , n tree ] ,
e ( t ) β ˜ ( t , 1 ) 3 e ( t ) , t [ 1 , t * + 1 ] ,
e ( s , t ) + e ( t , s ) β ^ ( s , t ) 3 e ( s , t ) + 3 e ( t , s ) , s [ 1 , s * ] , t [ 1 , t * ] .

Appendix A.6. Assigning Chemical Elements and Valence Condition

We include constraints so that each vertex v in a selected graph H satisfies the valence condition; i.e., u v E ( H ) β ( u v ) val ( α ( u ) ) . With these constraints, a rank-2 chemical graph G = ( H , α , β ) on a selected subgraph H will be constructed.
constants:
A set Λ { ϵ } of chemical elements, where ϵ denotes null;
A coding [ a ] , a Λ { ϵ } such that [ ϵ ] = 0 ; [ a ] 1 , a Λ ; and [ a ] [ b ] if a b ;
Let [ Λ ] and [ Λ { ϵ } ] denote { [ a ] a Λ } and { [ a ] a Λ { ϵ } } , respectively;
A valence function: val : Λ [ 1 , 4 ] ;
variables:
α ˜ ( p , i ) [ Λ { ϵ } ] , p [ 1 , cs * ] , i [ 1 , n tree ] :
α ˜ ( p , i ) with p s * (resp., p > s * ) represents α ( u p , i ) (resp., α ( v p s * , i ) );
δ α ( p , i , a ) { 0 , 1 } , p [ 1 , cs * ] , i [ 1 , n tree ] , a Λ { ϵ } :
δ α ( p , i , a ) = 1 α ( u p , i ) = a for p s * and α ( v p s * , i ) = a for p > s * ;
δ β ˜ ( i , k ) { 0 , 1 } , p [ 1 , cs * ] , i E 1 E 3 , k [ 0 , 3 ] :
δ β ˜ ( i , k ) = 1 ⇔ the multiplicity of edge a i in H is k;
δ β ˜ ( p , i , k ) { 0 , 1 } , p [ 1 , cs * ] , i [ 2 , n tree ] , k [ 0 , 3 ] :
δ β ˜ ( p , i , k ) = 1 ⇔ the multiplicity of edge e p , i , p s * (or e p s * , i , p > s * ) in H is k;
δ β ˜ ( t , 1 , k ) { 0 , 1 } , t [ 1 , t * + 1 ] , k [ 0 , 3 ] :
δ β ˜ ( t , 1 , k ) = 1 ⇔ the multiplicity of edge e t in H is k;
δ β ^ ( s , t , k ) { 0 , 1 } , s [ 1 , s * ] , t [ 1 , t * ] , k [ 0 , 3 ] :
δ β ^ ( s , t , k ) = 1 ⇔ the multiplicity of edge u s , 1 v t , 1 in H is k;
constraints:
a Λ { ϵ } δ α ( p , i , a ) = 1 , p [ 1 , cs * ] , i [ 1 , n tree ] ,
a Λ { ϵ } [ a ] · δ α ( p , i , a ) = α ˜ ( p , i ) , p [ 1 , cs * ] , i [ 1 , n tree ] ,
k [ 0 , 3 ] δ β ˜ ( i , k ) = 1 , i E 1 E 3 ,
k [ 1 , 3 ] k · δ β ˜ ( i , k ) = β ˜ ( i ) , i E 1 E 3 ,
k [ 0 , 3 ] δ β ˜ ( p , i , k ) = 1 , p [ 1 , cs * ] , i [ 2 , n tree ] ,
k [ 1 , 3 ] k · δ β ˜ ( p , i , k ) = β ˜ ( p , i ) , p [ 1 , cs * ] , i [ 2 , n tree ] ,
k [ 0 , 3 ] δ β ˜ ( t , 1 , k ) = 1 , t [ 1 , t * + 1 ] ,
k [ 1 , 3 ] k · δ β ˜ ( t , 1 , k ) = β ˜ ( t , 1 ) , t [ 1 , t * + 1 ] ,
k [ 0 , 3 ] δ β ^ ( s , t , k ) = 1 , s [ 1 , s * ] , t [ 1 , t * ] ,
k [ 0 , 3 ] k δ β ^ ( s , t , k ) = β ^ ( s , t ) , s [ 1 , s * ] , t [ 1 , t * ] ,
i E 1 , 3 ( s ) β ˜ ( i ) + t [ 1 , t * ] β ^ ( s , t )
+ j Cld ( 1 ) β ˜ ( s , j ) a Λ val ( a ) · δ α ( s , 1 , a ) , s [ 1 , s * ] , s [ 1 , s * ] β ^ ( s , t ) + β ˜ ( t , 1 ) + β ˜ ( t + 1 , 1 )
+ j Cld ( 1 ) β ˜ ( s * + t , j ) a Λ val ( a ) · δ α ( s * + t , 1 , a ) , t [ 1 , t * ] ,
β ˜ ( p , i ) + j Cld ( i ) β ˜ ( p , j ) a Λ val ( a ) · δ α ( p , i , a ) , p [ 1 , cs * ] , i [ 2 , n tree ] .

Appendix A.7. Descriptors for Mass, the Numbers of Elements and Bonds

We include constraints to compute descriptors ms ¯ ( G ) ce a co ( G ) , ce a nc ( G ) ( a Λ ) , b k ( G ) ( k [ 2 , 3 ] ) and n H ( G ) according to the definitions in Section 2.1.2.
constants:
A function mass * : Λ Z ; Let mass ( a ) denote the observed mass of a chemical element a Λ , and
define mass * ( a ) = 10 · mass ( a ) ;
variables:
ce co ( a ) [ 0 , n * ] , a Λ ;
ce nc ( a ) [ 0 , n * ] , a Λ ;
Mass Z ;
b co ( k ) [ 0 , 2 n * ] , k [ 1 , 3 ] ;
b nc ( k ) [ 0 , 2 n * ] , k [ 1 , 3 ] ;
n H [ 0 , 4 n * ] : the number of hydrogen atoms to be included in G;
constraints:
p [ 1 , cs * ] δ α ( p , 1 , a ) = ce co ( a ) , a Λ ,
p [ 1 , cs * ] , i [ 2 , n tree ] δ α ( p , i , a ) = ce nc ( a ) , a Λ ,
a Λ mass * ( a ) ( ce co ( a ) + ce nc ( a ) ) = Mass , i E 1 E 3 δ β ˜ ( i , k ) + s [ 1 , s * ] , t [ 1 , t * ] δ β ^ ( s , t , k )
+ t [ 2 , t * ] δ β ˜ ( t , 1 , k ) = b co ( k ) , k [ 1 , 3 ] ,
p [ 1 , cs * ] , i [ 2 , n tree ] δ β ˜ ( p , i , k ) = b nc ( k ) , k [ 1 , 3 ] , a Λ val ( a ) ( ce co ( a ) + ce nc ( a ) )
2 ( n * + 1 + b co ( 2 ) + b nc ( 2 ) + 2 b co ( 3 ) + 2 b nc ( 3 ) ) = n H .

Appendix A.8. Descriptor for the Number of Specified Degree

We include constraints to compute descriptors dg i ( G ) ( i [ 1 , 4 ] ) according to the definitions in Section 2.1.2. We also add constraints so that the maximum degree of a non-core vertex in H is at most 3 (resp., equal to 4) when d max = 3 (resp., d max = 4 ) .
variables:
deg ( p , i ) [ 0 , 4 ] , p [ 1 , cs * ] , i [ 1 , n tree ] :
deg ( p , i ) represents deg H ( u p , i ) for p s * or deg H ( v p s * , i ) for p > s * ;
δ deg ( p , i , d ) { 0 , 1 } , p [ 1 , cs * ] , i [ 1 , n tree ] , d [ 0 , 4 ] :
δ deg ( p , i , d ) = 1 deg ( p , i ) = d ;
dg ( d ) [ 0 , n * ] , d [ 1 , 4 ] ;
constraints:
i E 1 , 3 ( s ) a ( i )
+ t [ 1 , t * ] ( e ( s , t ) + e ( t , s ) ) + j Cld ( 1 ) u ( s , j ) = deg ( s , 1 ) , s [ 1 , s * ] ,
u ( s , i ) + j Cld ( i ) u ( s , j ) = deg ( s , i ) , s [ 1 , s * ] , i [ 2 , n tree ] ,
2 + j Cld ( 1 ) v ( t , j ) = deg ( s * + t , 1 ) , t [ 1 , t * ] ,
v ( t , i ) + j Cld ( i ) v ( t , j ) = deg ( s * + t , i ) , t [ 1 , t * ] , i [ 2 , n tree ] ,
d [ 0 , 4 ] δ deg ( p , i , d ) = 1 , p [ 1 , cs * ] , i [ 1 , n tree ] ,
d [ 1 , 4 ] d · δ deg ( p , i , d ) = deg ( p , i ) , p [ 1 , cs * ] , i [ 1 , n tree ] ,
p [ 1 , cs * ] , i [ 1 , n tree ] δ deg ( p , i , d ) = dg ( d ) , d [ 1 , 4 ] ,
p [ 1 , cs * ] , i [ 2 , n tree ] δ deg ( p , i , 4 ) 1 ( r e s p . , = 0 ) w h e n d max = 4 ( r e s p . , = 3 ) .

Appendix A.9. Descriptor for the Number of Adjacency-Configurations

We include constraints to compute descriptors ac γ co ( G ) and ac γ nc ( G ) ( γ = ( a , b , k ) Γ ) according to the definitions in Section 2.1.2.
constants:
A set Γ = Γ < Γ = Γ > of proper tuples ( a , b , k ) Λ × Λ × [ 1 , 3 ] ;
The set Γ 0 = { ( a , b , 0 ) a , b Λ { ϵ } } ;
variables:
δ τ ( i , γ ) { 0 , 1 } , i E 1 E 3 , γ Γ Γ 0 :
δ τ ( i , γ ) = 1 ⇔ edge a i is assigned tuple γ ; i.e., γ = ( α ˜ ( tail ( i ) , 1 ) , α ˜ ( head ( i ) , 1 ) , β ˜ ( i ) ) ;
δ τ ( t , 1 , γ ) { 0 , 1 } , t [ 2 , t * ] , γ Γ Γ 0 :
δ τ ( t , 1 , γ ) = 1 ⇔ edge e t is assigned tuple γ ; i.e., γ = ( α ˜ ( s * + t 1 , 1 ) , α ˜ ( s * + t , 1 ) , β ˜ ( t , 1 ) ) ;
δ τ ( t , i , γ ) { 0 , 1 } , p [ 1 , cs * ] , i [ 2 , n tree ] , γ Γ Γ 0 :
δ τ ( t , i , γ ) = 1 ⇔ edge e p , i , p s * (or e p s * , i , p > s * ) is assigned tuple γ ; i.e.,
γ = ( α ˜ ( p , prt ( i ) ) , α ˜ ( p , i ) , β ˜ ( p , i ) ) ;
δ τ ^ ( s , t , γ ) { 0 , 1 } , s [ 1 , s * ] , t [ 1 , t * ] , γ Γ Γ 0 :
δ τ ^ ( s , t , γ ) = 1 ⇔ edge u s , 1 v t , 1 is assigned tuple γ ; i.e., γ = ( α ˜ ( s , 1 ) , α ˜ ( s * + t , 1 ) , β ^ ( s , t ) ) ;
ac co ( γ ) [ 0 , n * ] , γ Γ < Γ = ;
ac nc ( γ ) [ 0 , n * ] , γ Γ < Γ = ;
constraints:
γ Γ Γ 0 δ τ ( i , γ ) = 1 , i E 1 E 3 ,
( a , b , k ) Γ Γ 0 [ a ] δ τ ( i , ( a , b , k ) ) = α ˜ ( tail ( i ) , 1 ) , i E 1 E 3 ,
( a , b , k ) Γ Γ 0 [ b ] δ τ ( i , ( a , b , k ) ) = α ˜ ( head ( i ) , 1 ) , i E 1 E 3 ,
( a , b , k ) Γ Γ 0 k · δ τ ( i , ( a , b , k ) ) = β ˜ ( i ) , i E 1 E 3 ,
γ Γ Γ 0 δ τ ( t , 1 , γ ) = 1 , t [ 2 , t * ] ,
( a , b , k ) Γ Γ 0 [ a ] δ τ ( t , 1 , ( a , b , k ) ) = α ˜ ( s * + t 1 , 1 ) , t [ 2 , t * ] ,
( a , b , k ) Γ Γ 0 [ b ] δ τ ( t , 1 , ( a , b , k ) ) = α ˜ ( s * + t , 1 ) , t [ 2 , t * ] ,
( a , b , k ) Γ Γ 0 k · δ τ ( t , 1 , ( a , b , k ) ) = β ˜ ( t , 1 ) , t [ 2 , t * ] ,
γ Γ Γ 0 δ τ ( p , i , γ ) = 1 , p [ 1 , cs * ] , i [ 2 , n tree ] ,
( a , b , k ) Γ Γ 0 [ a ] δ τ ( p , i , ( a , b , k ) ) = α ˜ ( p , prt ( i ) ) , p [ 1 , cs * ] , i [ 2 , n tree ] ,
( a , b , k ) Γ Γ 0 [ b ] δ τ ( p , i , ( a , b , k ) ) = α ˜ ( p , i ) , p [ 1 , cs * ] , i [ 2 , n tree ] ,
( a , b , k ) Γ Γ 0 k · δ τ ( p , i , ( a , b , k ) ) = β ˜ ( p , i ) , p [ 1 , cs * ] , i [ 2 , n tree ] ,
γ Γ Γ 0 δ τ ^ ( s , t , γ ) = 1 , s [ 1 , s * ] , t [ 1 , t * ] ,
( a , b , k ) Γ Γ 0 [ a ] δ τ ^ ( s , t , ( a , b , k ) ) = α ˜ ( s , 1 ) , s [ 1 , s * ] , t [ 1 , t * ] ,
( a , b , k ) Γ Γ 0 [ b ] δ τ ^ ( s , t , ( a , b , k ) ) = α ˜ ( s * + t , 1 ) , s [ 1 , s * ] , t [ 1 , t * ] ,
( a , b , k ) Γ Γ 0 k · δ τ ^ ( s , t , ( a , b , k ) ) = β ^ ( s , t ) , s [ 1 , s * ] , t [ 1 , t * ] ,
i E 1 E 3 ( δ τ ( i , γ ) + δ τ ( i , γ ¯ ) ) + s [ 1 , s * ] , t [ 1 , t * ] ( δ τ ^ ( s , t , γ ) + δ τ ^ ( s , t , γ ¯ ) )
+ t [ 2 , t * ] ( δ τ ( t , 1 , γ ) + δ τ ( t , 1 , γ ¯ ) ) = ac co ( γ ) , γ Γ < , i E 1 E 3 δ τ ( i , γ ) + s [ 1 , s * ] , t [ 1 , t * ] δ τ ^ ( s , t , γ )
+ t [ 2 , t * ] δ τ ( t , 1 , γ ) = ac co ( γ ) , γ Γ = ,
p [ 1 , cs * ] , i [ 2 , n tree ] ( δ τ ( p , i , γ ) + δ τ ( p , i , γ ¯ ) ) = ac nc ( γ ) , γ Γ < ,

Appendix A.10. Descriptor for 1-Path Connectivity

We include constraints to compute descriptor κ 1 ( G ) according to the definition.
variables:
A real variable κ 1 0 ;
δ dd ( i , d , d , μ ) { 0 , 1 } , i E 1 E 3 , d , d [ 0 , 4 ] , μ { 0 , 1 } :
δ dd ( i , d , d , μ ) = 1 deg H ( u tail ( i ) ) = d and deg H ( u head ( i ) ) = d ,
where a i is in H if and only if μ = 1 ;
δ dd ( t , 1 , d , d , μ ) { 0 , 1 } , t [ 2 , t * ] , d , d [ 0 , 4 ] : δ dd ( t , 1 , d , d , μ ) = 1
deg H ( v t 1 , 1 ) = d and deg H ( v t , 1 ) = d where e t is in H if and only if μ = 1 ;
δ dd ( p , i , d , d , μ ) { 0 , 1 } , p [ 1 , cs * ] , i [ 2 , n tree ] , d , d [ 0 , 4 ] : δ dd ( p , i , d , d , μ ) = 1
deg H ( u p , prt ( i ) ) = d and deg H ( u p , i ) = d for p s *
(or deg H ( v p s * , prt ( i ) ) = d and deg H ( v p s * , i ) = d for p > s * ),
where edge e p , i or e p s * , i is in H if and only if μ = 1 ;
δ dd ^ ( s , t , d , d , μ ) { 0 , 1 } , s [ 1 , s * ] , t [ 1 , t * ] , d , d [ 0 , 4 ] , μ { 0 , 1 } :
δ dd ^ ( s , t , d , d , 1 ) = 1 deg H ( u s , 1 ) = d and deg H ( v t , 1 ) = d ,
where u s , 1 v t , 1 is in H if and only if μ = 1 ;
constraints:
d , d [ 0 , 4 ] , μ { 0 , 1 } δ dd ( i , d , d , μ ) = 1 , i E 1 E 3 ,
d , d [ 0 , 4 ] , μ { 0 , 1 } μ · δ dd ( i , d , d , μ ) = a ( i ) , i E 1 E 3 ,
d [ 1 , 4 ] , d [ 0 , 4 ] , μ { 0 , 1 } d · δ dd ( i , d , d , μ ) = deg ( tail ( i ) , 1 ) , i E 1 E 3 ,
d [ 0 , 4 ] , d [ 1 , 4 ] , μ { 0 , 1 } d · δ dd ( i , d , d , μ ) = deg ( head ( i ) , 1 ) , i E 1 E 3 ,
d , d [ 0 , 4 ] , μ { 0 , 1 } δ dd ( t , 1 , d , d , μ ) = 1 , t [ 2 , t * ] ,
d , d [ 0 , 4 ] , μ { 0 , 1 } μ · δ dd ( t , 1 , d , d , μ ) = e ( t ) , t [ 2 , t * ] ,
d [ 1 , 4 ] , d [ 0 , 4 ] , μ { 0 , 1 } d · δ dd ( t , 1 , d , d , μ ) = deg ( s * + t 1 , 1 ) , t [ 2 , t * ] ,
d [ 0 , 4 ] , d [ 1 , 4 ] , μ { 0 , 1 } d · δ dd ( t , 1 , d , d , μ ) = deg ( s * + t , 1 ) , t [ 2 , t * ] ,
d , d [ 0 , 4 ] , μ { 0 , 1 } δ dd ( p , i , d , d , μ ) = 1 , p [ 1 , cs * ] , i [ 2 , n tree ] ,
d , d [ 0 , 4 ] , μ { 0 , 1 } μ · δ dd ( s , i , d , d , μ ) = u ( s , i ) , s [ 1 , s * ] , i [ 2 , n tree ] ,
d , d [ 0 , 4 ] , μ { 0 , 1 } μ · δ dd ( s * + t , i , d , d , μ ) = v ( t , i ) , t [ 1 , t * ] , i [ 2 , n tree ] ,
d [ 1 , 4 ] , d [ 0 , 4 ] , μ { 0 , 1 } d · δ dd ( p , i , d , d , μ ) = deg ( p , prt ( i ) ) , p [ 1 , cs * ] , i [ 2 , n tree ] ,
d [ 0 , 4 ] , d [ 1 , 4 ] , μ { 0 , 1 } d · δ dd ( t , i , d , d , μ ) = deg ( p , i ) , p [ 1 , cs * ] , i [ 2 , n tree ] ,
d , d [ 1 , 4 ] , μ { 0 , 1 } δ dd ^ ( s , t , d , d , μ ) = 1 , s [ 1 , s * ] , t [ 1 , t * ] ,
d , d [ 1 , 4 ] , μ { 0 , 1 } μ · δ dd ^ ( s , t , d , d , μ ) = e ( s , t ) + e ( t , s ) , s [ 1 , s * ] , t [ 1 , t * ] ,
d [ 1 , 4 ] , d [ 0 , 4 ] , μ { 0 , 1 } d · δ dd ^ ( s , t , d , d , μ ) = deg ( s , 1 ) , s [ 1 , s * ] , t [ 1 , t * ] ,
d [ 0 , 4 ] , d [ 1 , 4 ] , μ { 0 , 1 } d · δ dd ^ ( s , t , d , d , μ ) = deg ( s * + t , 1 ) , s [ 1 , s * ] , t [ 1 , t * ] , ( 1 ξ ) κ 1 i E 1 E 3 , d , d [ 1 , 4 ] δ dd ( i , d , d , 1 ) / d d + t [ 2 , t * ] , d , d [ 1 , 4 ] δ dd ( t , 1 , d , d , 1 ) / d d + p [ 1 , cs * ] , i [ 2 , n tree ] , d , d [ 1 , 4 ] δ dd ( p , i , d , d , 1 ) / d d
+ s [ 1 , s * ] , t [ 1 , t * ] , d , d [ 1 , 4 ] δ dd ^ ( s , t , d , d , 1 ) / d d ( 1 + ξ ) κ 1 ,
where a tolerance ξ is set to be 0.001 .

Appendix A.11. Constraints for Left-Heavy Trees

To reduce the number of rank-2 chemical graphs G that are isomorphic to each other, we include in C 2 some additional constraints so that each subtree T selected from tree S p or T t satisfies the following property:
for any two siblings u ( p , j 1 ) and u ( p , j 2 ) , j 1 < j 2 in T , the number of descendants of u ( p , j 1 ) is not smaller than that of u ( p , j 2 ) .
For this, we define dsn ( p , i ) to be the number of descendants of a vertex u p , i (or v p s * , i ) in a selected graph H and η ( p , i ) 21 | Λ | dsn ( p , i ) + 20 α ˜ ( p , i ) + 4 deg ( p , i ) + β ˜ ( p , i ) , p [ 1 , cs * ] , i [ 2 , n tree ] . We include constraints that compute the values of dsn recursively.
variables:
dsn ( p , i ) [ 1 , n tree ] , p [ 1 , cs * ] , i [ 1 , n tree ] : the number of descendants of vertex u p , i
in tree S p for p s * and vertex v p s * , i in tree T p s * for p > s * ;
constraints:
dsn ( s , i ) j Cld ( i ) dsn ( s , j ) + u ( s , i ) , s [ 1 , s * ] , i [ 1 , n tree ] ,
dsn ( s * + t , i ) j Cld ( i ) dsn ( s * + t , j ) + v ( t , i ) , t [ s * + 1 , cs * ] , i [ 1 , n tree ] ,
p [ 1 , cs * ] dsn ( p , 1 ) n * ,
η ( p , j 1 ) η ( p , j 2 ) , p [ 1 , cs * ] , j 1 , j 2 Cld ( 1 ) , j 1 < j 2 , η ( p , j 1 ) η ( p , j 2 ) , p [ 1 , cs * ] , i [ 2 , n in ] , j 1 , j 2 Cld ( i ) ,
j 1 < j 2 , f o r d max = 3 , η ( p , j 1 ) η ( p , j 2 ) η ( p , j 3 ) , p [ 1 , cs * ] , i [ 2 , n in ] , j 1 , j 2 , j 3 Cld ( i ) ,
j 1 < j 2 < j 3 , f o r d max = 4 .

References

  1. Miyao, T.; Kaneko, H.; Funatsu, K. Inverse QSPR/QSAR analysis for chemical structure generation (from y to x). J. Chem. Inf. Model. 2016, 56, 286–299. [Google Scholar] [CrossRef] [PubMed]
  2. Skvortsova, M.I.; Baskin, I.I.; Slovokhotova, O.L.; Palyulin, V.A.; Zefirov, N.S. Inverse problem in QSAR/QSPR studies for the case of topological indices characterizing molecular shape (Kier indices). J. Chem. Inf. Comput. Sci. 1993, 33, 630–634. [Google Scholar] [CrossRef]
  3. Ikebata, H.; Hongo, K.; Isomura, T.; Maezono, R.; Yoshida, R. Bayesian molecular design with a chemical language model. J. Comput. Aided Mol. Des. 2017, 31, 379–391. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  4. Rupakheti, C.; Virshup, A.; Yang, W.; Beratan, D.N. Strategy to discover diverse optimal molecules in the small molecule universe. J. Chem. Inf. Model. 2015, 55, 529–537. [Google Scholar] [CrossRef] [PubMed]
  5. Fujiwara, H.; Wang, J.; Zhao, L.; Nagamochi, H.; Akutsu, T. Enumerating treelike chemical graphs with given path frequency. J. Chem. Inf. Model. 2008, 48, 1345–1357. [Google Scholar] [CrossRef] [PubMed]
  6. Kerber, A.; Laue, R.; Grüner, T.; Meringer, M. MOLGEN 4.0. Match Commun. Math. Comput. Chem. 1998, 37, 205–208. [Google Scholar]
  7. Li, J.; Nagamochi, H.; Akutsu, T. Enumerating substituted benzene isomers of tree-like chemical graphs. IEEE/ACM Trans. Comput. Biol. Bioinform. 2016, 15, 633–646. [Google Scholar] [CrossRef] [PubMed]
  8. Reymond, J.L. The chemical space project. Accounts Chem. Res. 2015, 48, 722–730. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  9. Akutsu, T.; Fukagawa, D.; Jansson, J.; Sadakane, K. Inferring a Graph From Path Frequency. Discret. Appl. Math. 2012, 160, 1416–1428. [Google Scholar] [CrossRef] [Green Version]
  10. Nagamochi, H. A detachment algorithm for inferring a graph from path frequency. Algorithmica 2009, 53, 207–224. [Google Scholar] [CrossRef]
  11. Fazekas, S.Z.; Ito, H.; Okuno, Y.; Seki, S.; Taneishi, K. On computational complexity of graph inference from counting. Nat. Comput. 2013, 12, 589–603. [Google Scholar] [CrossRef]
  12. Bohacek, R.S.; McMartin, C.; Guida, W.C. The art and practice of structure-based drug design: A molecular modeling perspective. Med. Res. Rev. 1996, 16, 3–50. [Google Scholar] [CrossRef]
  13. Gómez-Bombarelli, R.; Wei, J.N.; Duvenaud, D.; Hernández-Lobato, J.M.; Sánchez-Lengeling, B.; Sheberla, D.; Aguilera-Iparraguirre, J.; Hirzel, T.D.; Adams, R.P.; Aspuru-Guzik, A. Automatic chemical design using a data-driven continuous representation of molecules. ACS Cent. Sci. 2018, 4, 268–276. [Google Scholar] [CrossRef] [PubMed]
  14. Segler, M.H.S.; Kogej, T.; Tyrchan, C.; Waller, M.P. Generating focused molecule libraries for drug discovery with recurrent neural networks. ACS Cent. Sci. 2017, 4, 120–131. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  15. Yang, X.; Zhang, J.; Yoshizoe, K.; Terayama, K.; Tsuda, K. ChemTS: An efficient python library for de novo molecular generation. Sci. Technol. Adv. Mater. 2017, 18, 972–976. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  16. Kusner, M.J.; Paige, B.; Hernández-Lobato, J.M. Grammar variational autoencoder. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; Volume 70, pp. 1945–1954. [Google Scholar]
  17. Akutsu, T.; Nagamochi, H. A Mixed Integer Linear Programming Formulation to Artificial Neural Networks. In Proceedings of the 2nd International Conference on Information Science and Systems, Tokyo, Japan, 16–19 March 2019; pp. 215–220. [Google Scholar]
  18. Azam, N.A.; Chiewvanichakorn, R.; Zhang, F.; Shurbevski, A.; Nagamochi, H.; Akutsu, T. A method for the inverse QSAR/QSPR based on artificial neural networks and mixed integer linear programming. In Proceedings of the 13th International Joint Conference on Biomedical Engineering Systems and Technologies, Valletta, Malta, 24–26 February 2020; Volume 3, pp. 101–108. [Google Scholar]
  19. Chiewvanichakorn, R.; Wang, C.; Zhang, Z.; Shurbevski, A.; Nagamochi, H.; Akutsu, T. A method for the inverse QSAR/QSPR based on artificial neural networks and mixed integer linear programming. In Proceedings of the ICBBB2020, Kyoto, Japan, 19–22 January 2020. [Google Scholar]
  20. Zhang, F.; Zhu, J.; Chiewvanichakorn, R.; Shurbevski, A.; Nagamochi, H.; Akutsu, T. A new integer linear programming formulation to the inverse QSAR/QSPR for acyclic chemical compounds using skeleton trees. In Proceedings of the 33rd International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems, Kitakyushu, Japan, 22–25 September 2020. [Google Scholar]
  21. Ito, R.; Azam, N.A.; Wang, C.; Shurbevski, A.; Nagamochi, H.; Akutsu, T. A novel method for the inverse QSAR/QSPR to monocyclic chemical compounds based on artificial neural networks and integer programming, 2020. In Proceedings of the BIOCOMP 2020, Las Vegas, NV, USA, 27–30 July 2020. [Google Scholar]
  22. Suzuki, M.; Nagamochi, H.; Akutsu, T. Efficient enumeration of monocyclic chemical graphs with given path frequencies. J. Cheminform. 2014, 6, 31. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  23. Tezuka, Y.; Oike, H. Topological polymer chemistry. Prog. Polym. Sci. 2002, 27, 1069–1122. [Google Scholar] [CrossRef]
  24. Netzeva, T.I.; Worth, A.P.; Aldenberg, T.; Benigni, R.; Cronin, M.T.; Gramatica, P.; Jaworska, J.S.; Kahn, S.; Klopman, G.; Marchant, C.A.; et al. Current status of methods for defining the applicability domain of (quantitative) structure-activity relationships: The report and recommendations of ECVAM workshop 52. Altern. Lab. Anim. 2005, 33, 155–173. [Google Scholar] [CrossRef] [PubMed]
  25. Tamura, Y.; Nishiyama, Y.; Wang, C.; Sun, Y.; Shurbevski, A.; Nagamochi, H.; Akutsu, T. Enumerating chemical graphs with mono-block 2-augmented tree structure from given upper and lower bounds on path frequencies. arXiv 2020, arXiv:2004.06367. [Google Scholar]
  26. Yamashita, K.; Masui, R.; Zhou, X.; Wang, C.; Shurbevski, A.; Nagamochi, H.; Akutsu, T. Enumerating chemical graphs with two disjoint cycles satisfying given path frequency specifications. arXiv 2020, arXiv:2004.08381. [Google Scholar]
Figure 1. An illustration of the three rank-2 polymer topologies M 1 , M 2 , M 3 PT ( 2 , 4 ) .
Figure 1. An illustration of the three rank-2 polymer topologies M 1 , M 2 , M 3 PT ( 2 , 4 ) .
Algorithms 13 00124 g001
Figure 2. An illustration of the least simple graphs of the rank-2 polymer topologies M 1 , M 2 , M 3 PT ( 2 , 4 ) in Figure 1 and a scheme graph ( K , E ) : (a) S ( M 1 ) ; (b) S ( M 2 ) ; (c) S ( M 3 ) ; (d) a scheme graph ( K = ( { u 1 , u 2 , u 3 , u 4 } , E ) , E = ( E 1 , E 2 , E 3 ) ) where each edge u i u j is directed from one end-vertex u i to the other end-vertex u j with i < j , and E 1 = { a 1 = ( u 1 , u 4 ) , a 2 = ( u 2 , u 3 ) , a 3 = ( u 2 , u 4 ) } , E 2 = { a 4 = ( u 1 , u 2 ) , a 5 = ( u 3 , u 4 ) } and E 3 = { a 6 = ( u 1 , u 2 ) , a 7 = ( u 3 , u 4 ) } , and the edges in E 1 (resp., E 2 and E 3 ) are depicted with dashed (resp., dotted and solid) lines.
Figure 2. An illustration of the least simple graphs of the rank-2 polymer topologies M 1 , M 2 , M 3 PT ( 2 , 4 ) in Figure 1 and a scheme graph ( K , E ) : (a) S ( M 1 ) ; (b) S ( M 2 ) ; (c) S ( M 3 ) ; (d) a scheme graph ( K = ( { u 1 , u 2 , u 3 , u 4 } , E ) , E = ( E 1 , E 2 , E 3 ) ) where each edge u i u j is directed from one end-vertex u i to the other end-vertex u j with i < j , and E 1 = { a 1 = ( u 1 , u 4 ) , a 2 = ( u 2 , u 3 ) , a 3 = ( u 2 , u 4 ) } , E 2 = { a 4 = ( u 1 , u 2 ) , a 5 = ( u 3 , u 4 ) } and E 3 = { a 6 = ( u 1 , u 2 ) , a 7 = ( u 3 , u 4 ) } , and the edges in E 1 (resp., E 2 and E 3 ) are depicted with dashed (resp., dotted and solid) lines.
Algorithms 13 00124 g002
Figure 3. An illustration of Step 1: A data set D π of chemical graphs G i , i = 1 , 2 , , m in a class G of graphs whose values a ( G i ) [ a ̲ , a ¯ ] of a chemical property π are available.
Figure 3. An illustration of Step 1: A data set D π of chemical graphs G i , i = 1 , 2 , , m in a class G of graphs whose values a ( G i ) [ a ̲ , a ¯ ] of a chemical property π are available.
Algorithms 13 00124 g003
Figure 4. An illustration of Step 2: Each chemical graph G G is mapped to a vector f ( G ) in a feature vector space R k for some positive integer k.
Figure 4. An illustration of Step 2: Each chemical graph G G is mapped to a vector f ( G ) in a feature vector space R k for some positive integer k.
Algorithms 13 00124 g004
Figure 5. An illustration of Step 3: A prediction function ψ N from the feature vector space R k to the range [ a ̲ , a ¯ ] is constructed based on an ANN N .
Figure 5. An illustration of Step 3: A prediction function ψ N from the feature vector space R k to the range [ a ̲ , a ¯ ] is constructed based on an ANN N .
Algorithms 13 00124 g005
Figure 6. An illustration of Step 4: Given a target value y * [ a ̲ , a ¯ ] , solving MILP M ( x , y , g ; C 1 , C 2 ) either delivers a set F * of vectors x * A D such that ( 1 ε ) y * ψ N ( x * ) ( 1 + ε ) y * or detects that no such vector x exists.
Figure 6. An illustration of Step 4: Given a target value y * [ a ̲ , a ¯ ] , solving MILP M ( x , y , g ; C 1 , C 2 ) either delivers a set F * of vectors x * A D such that ( 1 ε ) y * ψ N ( x * ) ( 1 + ε ) y * or detects that no such vector x exists.
Algorithms 13 00124 g006
Figure 7. An illustration of Step 5: For each vector x * F * , all chemical graphs G * G such that f ( G * ) = x * are generated.
Figure 7. An illustration of Step 5: For each vector x * F * , all chemical graphs G * G such that f ( G * ) = x * are generated.
Algorithms 13 00124 g007
Figure 8. An illustration of a tree-extension, where the vertices in V ( K ) are depicted with gray circles: (a) The structure of the rooted tree S s rooted at a vertex u s , 1 ; (b) the structure of the rooted tree T t rooted at a vertex v t , 1 ; (c) the ( a , b , c ) -tree-extension of the scheme graph in Figure 2d for a = t * = 3 , b = ch * = 2 and c = d max = 4 .
Figure 8. An illustration of a tree-extension, where the vertices in V ( K ) are depicted with gray circles: (a) The structure of the rooted tree S s rooted at a vertex u s , 1 ; (b) the structure of the rooted tree T t rooted at a vertex v t , 1 ; (c) the ( a , b , c ) -tree-extension of the scheme graph in Figure 2d for a = t * = 3 , b = ch * = 2 and c = d max = 4 .
Algorithms 13 00124 g008
Figure 9. (a) An example of an extension of the scheme graph; (b) an example of a rank-2 graph H with n ( H ) = 21 , cs ( H ) = 9 , ch ( H ) = 2 and θ ( H ) = 1 , where the labels of some vertices and edges indicate the corresponding vertices and edges in the ( t * , ch * , d max ) -tree-extension for cs * = cs ( H ) , ch * = ch ( H ) , s * = 4 , t * = cs * s * and d max = 3 ; (c) a subgraph H of ( t * = 5 , ch * = 2 , d max = 3 ) -tree-extension isomorphic to the rank-2 graph H in (b).
Figure 9. (a) An example of an extension of the scheme graph; (b) an example of a rank-2 graph H with n ( H ) = 21 , cs ( H ) = 9 , ch ( H ) = 2 and θ ( H ) = 1 , where the labels of some vertices and edges indicate the corresponding vertices and edges in the ( t * , ch * , d max ) -tree-extension for cs * = cs ( H ) , ch * = ch ( H ) , s * = 4 , t * = cs * s * and d max = 3 ; (c) a subgraph H of ( t * = 5 , ch * = 2 , d max = 3 ) -tree-extension isomorphic to the rank-2 graph H in (b).
Algorithms 13 00124 g009
Figure 10. An illustration of inferred rank-2 chemical graphs G * with θ = 2 : (a) y K ow * = 5 , θ = 2 , n = 30 , core size = 16, core height = 3, d max = 4 ; (b) y Mp * = 250 , θ = 2 , n = 30 , core size = 16, core height= 2, d max = 3 ; (c) y Bp * = 150 , θ = 2 , n = 25 , core size = 17, core height = 4, d max = 3 ; (d) y K ow * = 5 , y Mp * = 150 , y Bp * = 250 , θ = 2 , n = 22 , core size = 14, core height = 3, d max = 3 .
Figure 10. An illustration of inferred rank-2 chemical graphs G * with θ = 2 : (a) y K ow * = 5 , θ = 2 , n = 30 , core size = 16, core height = 3, d max = 4 ; (b) y Mp * = 250 , θ = 2 , n = 30 , core size = 16, core height= 2, d max = 3 ; (c) y Bp * = 150 , θ = 2 , n = 25 , core size = 17, core height = 4, d max = 3 ; (d) y K ow * = 5 , y Mp * = 150 , y Bp * = 250 , θ = 2 , n = 22 , core size = 14, core height = 3, d max = 3 .
Algorithms 13 00124 g010
Figure 11. An illustration of inferred rank-2 chemical graphs G * : (a) y K ow * = 5 , θ = 0 , n = 30 , core size = 14, core height = 2, d max = 3 ; (b) y Mp * = 250 , θ = 0 , n = 30 , core size = 16, core height = 2, d max = 4 ; (c) y Bp * = 150 , θ = 0 , n = 25 , core size = 17, core height = 2, d max = 3 .
Figure 11. An illustration of inferred rank-2 chemical graphs G * : (a) y K ow * = 5 , θ = 0 , n = 30 , core size = 14, core height = 2, d max = 3 ; (b) y Mp * = 250 , θ = 0 , n = 30 , core size = 16, core height = 2, d max = 4 ; (c) y Bp * = 150 , θ = 0 , n = 25 , core size = 17, core height = 2, d max = 3 .
Algorithms 13 00124 g011
Figure 12. An illustration of inferred rank-2 chemical graphs G * : (a) y K ow * = 5 , θ = 2 , n = 30 , core size = 15, core height = 5, d max = 4 ; (b) y Mp * = 250 , θ = 2 , n = 30 , core size = 17, core height = 2, d max = 3 ; (c) y Bp * = 150 , θ = 2 , n = 25 , core size = 17, core height = 3, d max = 3 .
Figure 12. An illustration of inferred rank-2 chemical graphs G * : (a) y K ow * = 5 , θ = 2 , n = 30 , core size = 15, core height = 5, d max = 4 ; (b) y Mp * = 250 , θ = 2 , n = 30 , core size = 17, core height = 2, d max = 3 ; (c) y Bp * = 150 , θ = 2 , n = 25 , core size = 17, core height = 3, d max = 3 .
Algorithms 13 00124 g012
Table 1. Results of Step 1 in Phase 1.
Table 1. Results of Step 1 in Phase 1.
π | D π | Λ | Γ | [ n ̲ , n ¯ ] [ cs ̲ , cs ¯ ] [ ch ̲ , ch ¯ ] [ θ ̲ , θ ¯ ] [ a ̲ , a ¯ ]
Kow93C,N,O9[9, 31][7, 16][0, 13][−5, 3][ 3.7 , 12.2]
Mp63C,N,O7[9, 31][7, 17][0, 4][−6, 3][−80, 300]
Bp45C,N,O,S,P,Cl9[9, 25][7, 15][0, 7][−4, 3][155, 420]
Table 2. Results of Steps 2 and 3 in Phase 1.
Table 2. Results of Steps 2 and 3 in Phase 1.
π kActivationArchitectureL-TimeTest R 2 (ave.)(Best)
Kow37relu(37,10,1)3.920.8660.964
Mp33relu(33,10,1)21.680.8050.916
Bp43relu(43,10,1)11.880.8020.947
Table 3. Results of Steps 4 and 5 with d max = 3 and θ = 2 .
Table 3. Results of Steps 4 and 5 with d max = 3 and θ = 2 .
π y * n * | F * | / # IIP-Time # G * G-Time
Kow51512/129.961002236.0
Kow52012/1230.3812 > 1 h
Kow52512/1247.5712 > 1 h
Kow53012/1269.3812 > 1 h
Mp1501512/129.521002069.0
Mp1502012/1222.7912 > 1 h
Mp1502512/1247.2012 > 1 h
Mp1503012/1266.9012 > 1 h
Bp2501511/129.50100103.5
Bp2501912/1219.0812 > 1 h
Bp2502212/1225.7812 > 1 h
Bp2502512/1267.6412 > 1 h
Table 4. Results of Steps 4 and 5 with d max = 4 and θ = 2 .
Table 4. Results of Steps 4 and 5 with d max = 4 and θ = 2 .
π y * n * | F * | / # IIP-Time # G * G-Time
Kow51511/1231.84100413.8
Kow52012/1269.6512 > 1 h
Kow52512/12144.2011 > 1 h
Kow53012/12352.0112 > 1 h
Mp150159/1220.68100947.4
Mp1502011/1273.7311 > 1 h
Mp150259/12140.099 > 1 h
Mp1503012/12304.0412 > 1 h
Bp250157/1228.51100232.7
Bp2501911/1282.0111 > 1 h
Bp2502212/12150.5512 > 1 h
Bp2502512/12239.8412 > 1 h
Table 5. Results of Steps 4 and 5 with d max = 3 and θ = 0 .
Table 5. Results of Steps 4 and 5 with d max = 3 and θ = 0 .
π y * n * | F * | / # IIP-Time # G * G-Time
Kow51512/1211.00100121.1
Kow52012/1225.6412 > 1 h
Kow52512/1238.7912 > 1 h
Kow53012/1249.6512 > 1 h
Mp1501512/128.45100373.4
Mp1502012/1218.9412 > 1 h
Mp1502512/1237.1312 > 1 h
Mp1503012/1244.7454 > 1 h
Bp250159/128.45010074.2
Bp2501911/1216.3111 > 1 h
Bp2502212/1221.7112 > 1 h
Bp2502512/1245.8012 > 1 h
Table 6. Results of Steps 4 and 5 with d max = 4 and θ = 0 .
Table 6. Results of Steps 4 and 5 with d max = 4 and θ = 0 .
π y * n * | F * | / # IIP-Time # G * G-Time
Kow5159/1236.3310023.2
Kow52012/1282.0112 > 1 h
Kow52512/12138.9612 > 1 h
Kow53012/12292.7912 > 1 h
Mp150159/1219.89100557.6
Mp1502011/1263.6211 > 1 h
Mp1502512/12112.4912 > 1 h
Mp1503012/12171.1112 > 1 h
Bp250153/1234.6010011.2
Bp250196/12203.656 > 1 h
Bp250229/12218.079 > 1 h
Bp2502511/12783.8011 > 1 h
Table 7. Results of Steps 4 and 5 with d max = 3 and θ = 2 .
Table 7. Results of Steps 4 and 5 with d max = 3 and θ = 2 .
π y * n * | F * | / # IIP-Time # G * G-Time
Kow51512/1211.641001386.7
Kow52012/1223.8412>1 h
Kow52512/1233.7112>1 h
Kow53012/1261.8512>1 h
Mp1501512/129.801001614.3
Mp1502012/1220.1512>1 h
Mp1502512/1236.4212>1 h
Mp1503012/1240.5812>1 h
Bp2501511/1210.251001756.1
Bp2501912/1216.0212>1 h
Bp2502212/1223.6312>1 h
Bp2502512/1263.8412>1 h
Table 8. Results of Steps 4 and 5 with d max = 4 and θ = 2 .
Table 8. Results of Steps 4 and 5 with d max = 4 and θ = 2 .
π y * n * | F * | / # IIP-Time # G * G-Time
Kow51511/1228.1510020.3
Kow52012/1271.9012 > 1 h
Kow52512/12112.7112 > 1 h
Kow53012/12267.2112 > 1 h
Mp150159/1222.531002748.1
Mp15020 11/1253.4411>1 h
Mp1502512/12143.3312>1 h
Mp1503012/12220.6312>1 h
Bp250156/1227.33100254.2
Bp250199/1275.509>1 h
Bp2502211/12133.0111>1 h
Bp2502512/12228.7512>1 h

Share and Cite

MDPI and ACS Style

Zhu, J.; Wang, C.; Shurbevski, A.; Nagamochi, H.; Akutsu, T. A Novel Method for Inference of Chemical Compounds of Cycle Index Two with Desired Properties Based on Artificial Neural Networks and Integer Programming. Algorithms 2020, 13, 124. https://doi.org/10.3390/a13050124

AMA Style

Zhu J, Wang C, Shurbevski A, Nagamochi H, Akutsu T. A Novel Method for Inference of Chemical Compounds of Cycle Index Two with Desired Properties Based on Artificial Neural Networks and Integer Programming. Algorithms. 2020; 13(5):124. https://doi.org/10.3390/a13050124

Chicago/Turabian Style

Zhu, Jianshen, Chenxi Wang, Aleksandar Shurbevski, Hiroshi Nagamochi, and Tatsuya Akutsu. 2020. "A Novel Method for Inference of Chemical Compounds of Cycle Index Two with Desired Properties Based on Artificial Neural Networks and Integer Programming" Algorithms 13, no. 5: 124. https://doi.org/10.3390/a13050124

APA Style

Zhu, J., Wang, C., Shurbevski, A., Nagamochi, H., & Akutsu, T. (2020). A Novel Method for Inference of Chemical Compounds of Cycle Index Two with Desired Properties Based on Artificial Neural Networks and Integer Programming. Algorithms, 13(5), 124. https://doi.org/10.3390/a13050124

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop