A Novel Method for Inference of Chemical Compounds of Cycle Index Two with Desired Properties Based on Artificial Neural Networks and Integer Programming

Zhu, Jianshen; Wang, Chenxi; Shurbevski, Aleksandar; Nagamochi, Hiroshi; Akutsu, Tatsuya

doi:10.3390/a13050124

Open AccessArticle

A Novel Method for Inference of Chemical Compounds of Cycle Index Two with Desired Properties Based on Artificial Neural Networks and Integer Programming

by

Jianshen Zhu

^1,†,

Chenxi Wang

^1,†,

Aleksandar Shurbevski

¹

,

Hiroshi Nagamochi

^1,* and

Tatsuya Akutsu

^2,*

¹

Department of Applied Mathematics and Physics, Kyoto University, Kyoto 606-8501, Japan

²

Bioinformatics Center, Institute for Chemical Research, Kyoto University, Uji 611-0011, Japan

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Algorithms 2020, 13(5), 124; https://doi.org/10.3390/a13050124

Submission received: 22 April 2020 / Revised: 13 May 2020 / Accepted: 13 May 2020 / Published: 18 May 2020

(This article belongs to the Special Issue 2020 Selected Papers from Algorithms Editorial Board Members)

Download

Browse Figures

Versions Notes

Abstract

Inference of chemical compounds with desired properties is important for drug design, chemo-informatics, and bioinformatics, to which various algorithmic and machine learning techniques have been applied. Recently, a novel method has been proposed for this inference problem using both artificial neural networks (ANN) and mixed integer linear programming (MILP). This method consists of the training phase and the inverse prediction phase. In the training phase, an ANN is trained so that the output of the ANN takes a value nearly equal to a given chemical property for each sample. In the inverse prediction phase, a chemical structure is inferred using MILP and enumeration so that the structure can have a desired output value for the trained ANN. However, the framework has been applied only to the case of acyclic and monocyclic chemical compounds so far. In this paper, we significantly extend the framework and present a new method for the inference problem for rank-2 chemical compounds (chemical graphs with cycle index 2). The results of computational experiments using such chemical properties as octanol/water partition coefficient, melting point, and boiling point suggest that the proposed method is much more useful than the previous method.

Keywords:

mixed integer linear programming; QSAR/QSPR; molecular design

1. Introduction

Inference of chemical compounds with desired properties is important for computer-aided drug design. Since drug design is one of the major targets of chemo-informatics and bioinformatics, it is also important in these areas. Indeed, this problem has been extensively studied in chemo-informatics under the name of inverse QSAR/QSPR [1,2], where QSAR/QSPR denotes Quantitative Structure Activity/Property Relationships. Since chemical compounds are usually represented as undirected graphs, this problem is important also from graph theoretic and algorithmic viewpoints.

Inverse QSAR/QSPR is often formulated as an optimization problem to find a chemical graph maximizing (or minimizing) an objective function under various constraints, where objective functions reflect certain chemical activities or properties. In many cases, objective functions are derived from a set of training data consisting of known molecules and their activities/properties using statistical and machine learning methods.

In both forward and inverse QSAR/QSPR, chemical graphs are often represented as vectors of real or integer numbers because it is difficult to directly handle graphs using statistical and machine learning methods. Elements of these vectors are called descriptors in QSAR/QSPR studies, and these vectors correspond to feature vectors in machine learning. Using these chemical descriptors, various heuristic and statistical methods have been developed for finding optimal or nearly optimal graph structures under given objective functions [1,3,4]. In many cases, inference or enumeration of graph structures from a given feature vector is a crucial subtask in these methods. Various methods have been developed for this enumeration problem [5,6,7,8] and the computational complexity of the inference problem has been analyzed [9,10,11]. On the other hand, enumeration in itself is a challenging task, since the number of molecules (i.e., chemical graphs) with up to 30 atoms (vertices) C, N, O, and S, may exceed

10^{60}

[12].

As in many other fields, Artificial Neural Network (ANN) and deep learning technologies have recently been applied to inverse QSAR/QSPR. For example, variational autoencoders [13], recurrent neural networks [14,15], and grammar variational autoencoders [16] have been applied. In these approaches, new chemical graphs are generated by solving a kind of inverse problems on neural networks, where neural networks are trained using known chemical compound/activity pairs. However, the optimality of the solution is not necessarily guaranteed in these approaches. In order to guarantee the optimality, a novel approach has been proposed [17] for ANNs with ReLU activation functions and sigmoid activation functions, using mixed integer linear programming (MILP). In their approach, activation functions on neurons are efficiently encoded as piece-wise linear functions so as to represent ReLU functions exactly and sigmoid functions approximately.

Recently, a new framework has been proposed [18,19,20] by combining two previous approaches; efficient enumeration of tree-like graphs [5], and MILP-based formulation of the inverse problem on ANNs [17]. This combined framework for inverse QSAR/QSPR mainly consists of two phases, one for constructing a prediction function to a chemical property, and the other for constructing graphs based on the inverse of the prediction function. The first phase solves (I) Prediction Problem, where a prediction function

ψ_{N}

on a chemical property

π

is constructed with an ANN

N

using a data set of chemical compounds G and their values

a (G)

of

π

. The second phase solves (II) Inverse Problem, where (II-a) given a target value

y^{*}

of the chemical property

π

, a feature vector

x^{*}

is inferred from the trained ANN

N

so that

ψ_{N} (x^{*})

is close to

y^{*}

and (II-b) then a set of chemical structures

G^{*}

such that

f (G^{*}) = x^{*}

is enumerated. In (II-b) of the above-mentioned previous methods [18,19,20], an MILP is formulated for acyclic chemical compounds. Their methods were applicable only to acyclic chemical graphs (i.e., tree-structured chemical graphs), where the ratio of acyclic chemical graphs in a major chemical database (PubChem) is 2.91%. Afterward, Ito et al. [21] designed a method of inferring monocyclic chemical graphs (chemical graphs with cycle index or rank 1) by formulating a new MILP and using an efficient algorithm for enumerating monocyclic chemical graphs [22]. This still leaves a big limitation because the ratio of acyclic and monocyclic chemical graphs in the chemical database PubChem is only 16.26%.

To break this limitation, we significantly extend the MILP-based approach for inverse QSAR/QSPR so that “rank-2 chemical compounds” (chemical graphs with cycle index or rank 2) can be efficiently handled, where the ratio of chemical graphs with rank at most 2 in the database PubChem is 44.5%. Note that there are three different topological structures, called polymer-topologies over all rank-2 chemical compounds. In particular, we propose a novel MILP formulation for (II-a) along with a new set of descriptors. One big advantage of this new formulation is that an MILP instance has a solution if and only if there exists a rank-2 chemical graph satisfying given constraints, which is useful to significantly reduce redundant search in (II-b). We conducted computational experiments to infer rank-2 chemical compounds on several chemical properties.

The paper is organized as follows. Section 2.1 introduces some notions on graphs, a modeling of chemical compounds, and a choice of descriptors. Section 2.2 reviews the framework for inferring chemical compounds based on ANNs and MILPs. Section 2.3 introduces a method of modeling rank-2 chemical graphs with different cyclic structures in a unified way and proposes an MILP formulation that represents a rank-2 chemical graph G of n vertices, where our MILP requires only

O (n)

variables and constraints when the maximum height of subtrees in G is constant. Section 3 reports the results on some computational experiments conducted for chemical properties such as octanol/water partition coefficient, melting point, and boiling point. Section 4 makes some concluding remarks. Appendix A provides the detail of all variables and constraints in our MILP formulation.

2. Materials and Methods

2.1. Preliminary

This section introduces some notions and terminology on graphs, a modeling of chemical compounds, and our choice of descriptors.

2.1.1. Multigraphs and Graphs

Let

R

and

Z

denote the sets of reals and non-negative integers, respectively. For two integers a and b, let

[a, b]

denote the set of integers i with

a \leq i \leq b

.

Multigraphs

A multigraph is defined to be a pair

(V, E)

of a vertex set V and an edge set E such that each edge

e \in E

joins two vertices

u, v \in V

(possibly

u = v

) and the vertices u and v are called the end-vertices of the edge e, and let

V (e)

denote the set of the end-vertices of an edge

e \in E

, where an edge e with

| V (e) | = 1

is called a loop. We denote the vertex and edge sets of a multigraph M by

V (M)

and

E (M)

, respectively. A path with end-vertices u and v is called a

u, v

-path, and the length of a path is defined to be the number of edges in the path.

Let M be a multigraph. An edge

e \in E (M)

is called multiple (to an edge

e^{'} \in E (M)

) if there is another edge

e^{'} \in E (M)

with

V (e) = V (e^{'})

. For a vertex

v \in V (M)

, the set of neighbors of v in M is denoted by

N_{M} (v)

, and the degree

\deg_{M} (v)

of v is defined to be the number of times an edge in

E (M)

is incident to v; i.e.,

\deg_{M} (v) = | {e \in E (M) ∣ v \in V (e), | V (e) | = 2} | + 2 | {e \in E (M) ∣ v \in V (e), | V (e) | = 1} |

. A multigraph is called simple if it has no loop and there is at most one edge between any two vertices. We observe that the sum of the degrees over all vertices is twice the number of edges in any multigraph M; i.e.,

2 | E (M) | = \sum_{v \in V (M)} \deg_{M} (v) .

For a subset X of vertices in M, let

M - X

denote the multigraph obtained from M by removing the vertices in X and any edge incident to a vertex in X. An operation of subdividing a non-loop edge (resp., loop)

e \in E (M)

with

V (e) = {v_{1}, v_{2}}

(resp.,

V (e) = {v_{1} = v_{2}}

) is to replace e with two new edges

e_{1}

and

e_{2}

incident to a new vertex

v_{e}

such that each

e_{i}

is incident to

v_{i}

. An operation of contracting a vertex u of degree 2 in M is to replace the two edges

u v

and

u v^{'}

incident to u with a single edge

v v^{'}

removing vertex u, where the resulting edge is a loop when

v = v^{'}

. The rank

r (M)

of a multigraph M is defined to be the minimum number of edges to be removed to make the multigraph acyclic. We call a multigraph M with

r (M) = k

a rank-k graph. Let

V_{\deg, i} (M)

denote the set of vertices of degree i in M. The core

Cr (M)

of M is defined to be an induced subgraph

M^{*}

that is obtained from

M^{'} : = M

by setting

M^{'} : = M^{'} - V_{\deg}, 1 (M^{'})

repeatedly until

M^{*}

contains at most two vertices or consists of vertices of degree at least 2. The core

M^{*}

of a connected multigraph M consists of a single vertex (resp., two vertices) if and only if M is a tree with an even (resp., odd) diameter. A vertex (resp., an edge) in M is called a core vertex (resp., core edge) if it is contained in the core of M and is called a non-core vertex (resp., non-core edge) otherwise. The core size

cs (M)

is defined to be the number of core vertices of M, and the core height

ch (M)

is defined to be the maximum length of a path between a vertex

v \in V (M^{*})

to a leaf of M without passing through any core edge. The set of non-core edges induces a collection of subtrees, each of which we call a non-core component of M, where each non-core component C contains exactly one core vertex

v_{C}

and we regard C as a tree rooted at

v_{C}

. Let C be a non-core component of M. The height

height (v)

of a vertex v in C is defined to be the maximum length of a path from v to a leaf u in the descendants of v.

A multigraph is called a polymer topology if it is connected and the degree of every vertex is at least 3. Tezuka and Oike [23] pointed out that a classification of polymer topologies will lay a foundation for elucidation of structural relationships between different macro-chemical molecules and their synthetic pathways. For integers

r \geq 0

and

d \geq 3

, let

PT (r, d)

denote the set of all rank-r polymer topologies with maximum degree at most d. Figure 1 illustrates the three rank-2 polymer topologies in

PT (2, 4)

.

For a polymer topology M, the least simple graph

S (M)

of M is defined to be a simple graph obtained from M by subdividing each loop in M with two new vertices of degree 2 and subdividing all multiple edges (except for one) between every two adjacent vertices in M. Note that

| V (S (M)) | = | V (M) | + r + s

for the rank r of M and the number s of loops in M.

The polymer topology

Pt (M)

of a multigraph M with

r (M) \geq 2

is defined to be a multigraph

M^{'}

of degree at least 3 that is obtained from the core

Cr (M)

by contracting all vertices of degree 2. Note that

r (Pt (M)) = r (M)

. Figure 2a–c illustrate the least simple graph

S (M)

of each polymer topology

M \in PT (2, 4)

, where Figure 2d illustrates a graph that contains all least simple graphs.

Graphs

Let

H = (V, E)

be a graph with a set V of vertices and a set E of edges. Define the 1-path connectivity

κ_{1} (H)

of H to be

\sum_{u v \in E} 1 / \sqrt{\deg_{H} (u) \deg_{H} (v)}

.

Let H be a rank-2 connected graph such that the maximum degree is at most 4. We see that H contains two vertices

v_{a}

and

v_{b}

such that either there are three disjoint paths between

v_{a}

and

v_{b}

or H contains two edge disjoint cycles C and

C^{'}

, which are joined with a path between

v_{a}

and

v_{b}

(possibly

v_{a} = v_{b}

). We introduce the topological parameter

θ (H)

of rank-2 connected graph H as follows. When H has three disjoint paths between

v_{a}

and

v_{b}

, define

θ (H)

to be the minimum number of edges along a path between

v_{a}

and

v_{b}

. When H contains two edge disjoint cycles C and

C^{'}

, which are joined with a path P between

v_{a}

and

v_{b}

(possibly

v_{a} = v_{b}

), define

θ (H)

to be

- | E (P) |

.

For positive integers

a, b

and c with

b \geq 2

, let

T (a, b, c)

denote the rooted tree such that the number of children of the root is a, the number of children of each non-root internal vertex is b and the distance from the root to each leaf is c. In the rooted tree

T (a, b, c)

, we denote the vertices by

v_{1}, v_{2}, \dots, v_{n}

(

n = a (b^{c} - 1) / (b - 1) + 1

) with a breadth-first-search order, and denote the edge between a vertex

v_{i}

with

i \in [2, n]

and its parent by

e_{i}

. For each vertex

v_{i}

in

T (a, b, c)

, let

Cld (i)

denote the set of indices j such that

v_{j}

is a child of

v_{i}

, and

prt (i)

denote the index j such that

v_{j}

is the parent of

v_{i}

when

i \in [2, n]

.

2.1.2. Modeling of Chemical Compounds

Chemical Graphs

We represent the graph structure of a chemical compound as a graph with labels on vertices and multiplicity on edges in a hydrogen-suppressed model. Nearly 68.5% (resp., 99%) of the rank-2 chemical graphs with at most 200 non-hydrogen atoms registered in chemical database PubChem have a maximum degree at most 3 (resp., 4) for all non-core vertices in the hydrogen-suppressed model.

Let

Λ

be a set of labels, each of which represents a chemical element such as C (carbon), O (oxygen), N (nitrogen) and so on, where we assume that

Λ

does not contain H (hydrogen). Let

mass (a)

and

val (a)

denote the mass and valence of a chemical element

a \in Λ

, respectively. In our model, we use integers

{mass}^{*} (a) = ⌊ 10 \cdot mass (a) ⌋

,

a \in Λ

and assume that each chemical element

a \in Λ

has a unique valence

val (a) \in [1, 4]

.

We introduce a total order < over the elements in

Λ

according to their mass values; i.e., we write

a < b

for chemical elements

a, b \in Λ

with

mass (a) < mass (b)

. Choose a set

Γ_{<}

of tuples

γ = (a, b, k) \in Λ \times Λ \times [1, 3]

such that

a < b

. For a tuple

γ = (a, b, k) \in Λ \times Λ \times [1, 3]

, let

\bar{γ}

denote the tuple

(b, a, k)

. Set

Γ_{>} = {\bar{γ} ∣ γ \in Γ_{<}}

,

Γ_{=} = {(a, a, k) ∣ a \in Λ, k \in [1, 3]}

and

Γ = Γ_{<} \cup Γ_{=}

. a pair of two atoms

a

and

b

joined with a bond of multiplicity k is denoted by a tuple

γ = (a, b, k) \in Γ

, called the adjacency-configuration of the atom pair.

We use a hydrogen-suppressed model because hydrogen atoms can be added at the final stage. a chemical graph over

Λ

and

Γ

is defined to be a tuple

G = (H, α, β)

of a graph

H = (V, E)

, a function

α : V \to Λ

and a function

β : E \to [1, 3]

such that

(i): H is connected;
(ii): $\sum_{u v \in E} β (u v) \leq val (α (u))$ for each vertex $u \in V$ ; and
(iii): $(α (u), α (v), β (u v)) \in Γ$ for each edge $u v \in E$ .

Let

G (Λ, Γ)

denote the set of chemical graphs over

Λ

and

Γ

.

Descriptors

In our method, we use only graph-theoretical descriptors for defining a feature vector, which facilitates our designing an algorithm for constructing graphs. Given a chemical graph

G = (H, α, β)

, we define a feature vector

f (G)

that consists of the following 14 kinds of descriptors:

-: $n (G)$ : the number of vertices in G;
-: $cs (G)$ : the core size of G;
-: $ch (G)$ : the core height of G;
-: $κ_{1} (G)$ : the 1-path connectivity of G;
-: ${dg}_{i} (G)$ ( $i \in [1, 4]$ ): the number of vertices of degree i in G;
-: ${ce}_{a}^{co} (G)$ ( $a \in Λ)$ : the number of core vertices with chemical element $a \in Λ$ ;
-: ${ce}_{a}^{nc} (G)$ ( $a \in Λ)$ : the number of non-core vertices with chemical element $a \in Λ$ ;
-: $\bar{ms} (G)$ : the average of ${mass}^{*}$ of atoms in G;
-: $b_{k}^{co} (G)$ ( $k \in [2, 3]$ ): the number of double and triple bonds in core edges;
-: $b_{k}^{nc} (G)$ ( $k \in [2, 3]$ ): the number of double and triple bonds in non-core edges;
-: ${ac}_{γ}^{co} (G)$ ( $γ = (a, b, k) \in Γ$ ): the number of adjacency-configurations $(a, b, k)$ of core edges;
-: ${ac}_{γ}^{nc} (G)$ ( $γ = (a, b, k) \in Γ$ ): the number of adjacency-configurations $(a, b, k)$ of non-core edges;
-: $θ (H)$ : the topological parameter of H; and
-: $n_{H} (G)$ : the number of hydrogen atoms to be included in G; i.e.,
$n_{H} (G) = \sum_{a \in Λ} val (a) ({ce}_{a}^{co} (G) + {ce}_{a}^{nc} (G)) - 2 (n (G) + 1 + b_{2}^{co} (G) + b_{2}^{nc} (G) + 2 b_{3}^{co} (G) + 2 b_{3}^{nc} (G))$ .

The number k of descriptors in our feature vector

x = f (G)

is

k = 2 | Λ | + 2 | Γ | + 15

.

2.2. A Method for Inferring Chemical Graphs

This section reviews the framework that solves the inverse QSAR/QSPR by using MILPs [18]. For a specified chemical property

π

such as boiling point, we denote by

a (G)

the observed value of the property

π

for a chemical compound G. As the Phase 1, we solve (I) Prediction Problem with the following three steps.

Phase 1.

Step 1: Let

DB

be a set of chemical graphs. For a specified chemical property

π

, choose a class

G

of graphs such as acyclic graphs or monocyclic graphs. Prepare a data set

D_{π} = {G_{i} ∣ i = 1, 2, \dots, m} \subseteq G \cap DB

such that the value

a (G_{i})

of each chemical graph

G_{i}

,

i = 1, 2, \dots, m

is available. Set reals

\underset{̲}{a}, \bar{a} \in R

so that

\underset{̲}{a} \leq a (G_{i}) \leq \bar{a}

,

i = 1, 2, \dots, m

. See Figure 3 for an illustration of Step 1.

Step 2: Introduce a feature function

f : G \to R^{k}

for a positive integer k. We call

f (G)

the feature vector of

G \in G

, and call each entry of a vector

f (G)

a descriptor of G. See Figure 4 for an illustration of Step 2.

Step 3: Construct a prediction function

ψ_{N}

with an ANN

N

that, given a vector in

R^{k}

, returns a real in the range

[\underset{̲}{a}, \bar{a}]

so that

ψ_{N} (f (G))

takes a value nearly equal to

a (G)

for many chemical graphs in D. See Figure 5 for an illustration of Step 3.

Next we explain how to solve the inverse problem to the prediction in Phase 1 using an MILP formulation. A vector

x \in R^{k}

is called admissible if there is a graph

G \in G

such that

f (G) = x

[18]. Let

A

denote the set of admissible vectors

x \in R^{k}

. In this paper, we use the range-based method to define an applicability domain (AD) [24] to our inverse QSAR/QSPR. Set

\underset{̲}{x_{j}}

and

\bar{x_{j}}

to be the minimum and maximum values of the j-th descriptor

x_{j}

in

f (G_{i})

over all graphs

G_{i}

,

i = 1, 2, \dots, m

(where we possibly normalize some descriptors such as

{ce}_{a}^{co} (G)

, which is normalized with

{ce}_{a}^{co} (G) / n (H)

). Define our AD

D

to be the set of vectors

x \in R^{k}

such that

\underset{̲}{x_{j}} \leq x_{j} \leq \bar{x_{j}}

for the variable

x_{j}

of each j-th descriptor,

j = 1, 2, \dots, k

. As the second phase, we solve (II) Inverse Problem for the inverse QSAR/QSPR by treating the following inference problems.

(II-a) Inference of Vectors

Input: A real

y^{*} \in [\underset{̲}{a}, \bar{a}]

.

Output: Vectors

x^{*} \in A \cap D

and

g^{*} \in R^{h}

such that

ψ_{N} (x^{*}) = y^{*}

and

g^{*}

forms a chemical graph

G^{*} \in G

with

f (G^{*}) = x^{*}

.

(II-b) Inference of Graphs

Input: A vector

x^{*} \in A \cap D

.

Output: All graphs

G^{*} \in G

such that

f (G^{*}) = x^{*}

.

To treat Problem (II-a), we use MILPs for inferring vectors in ANNs [17]. In MILPs, we can easily impose additional linear constraints or fix some variables to specified constants. We include into the MILP a linear constraint such that

x \in D

to obtain the next result.

Theorem 1.

Let

N

be an ANN with a piecewise-linear activation function for an input vector

x \in R^{k}

,

n_{A}

denote the number of nodes in the architecture and

n_{B}

denote the total number of break-points over all activation functions. Then there is an MILP

M (x, y; C_{1})

that consists of variable vectors

x \in D (\subseteq R^{k})

,

y \in R

, and an auxiliary variable vector

z \in R^{p}

for some integer

p = O (n_{A} + n_{B})

and a set

C_{1}

of

O (n_{A} + n_{B})

constraints on these variables such that:

ψ_{N} (x^{*}) = y^{*}

if and only if there is a vector

(x^{*}, y^{*})

feasible to

M (x, y; C_{1})

.

See Appendix A for the set of constraints to define our AD

D

in the MILP

M (x, y; C_{1})

in Theorem 1.

To attain the admissibility of inferred vector

x^{*}

, we also introduce a variable vector

g \in R^{q}

for some integer q and a set

C_{2}

of constraints on x and g such that

x^{*} \in A

holds in the following sense:

(x^{*}, g^{*})

is feasible to the MILP

M (x, g; C_{2})

if and only if

g^{*}

forms a chemical graph

G^{*} \in G

with

f (G^{*}) = x^{*}

. The Phase 2 consists of the next two steps.

Phase 2.

Step 4: Formulate Problem (II-a) as the above MILP

M (x, y, g; C_{1}, C_{2})

based on

G

and

N

. Find a set

F^{*}

of vectors

x^{*} \in A \cap D

such that

(1 - ε) y^{*} \leq ψ_{N} (x^{*}) \leq (1 + ε) y^{*}

for a tolerance

ε

set to be a small positive real. See Figure 6 for an illustration of Step 4.

Step 5: To solve Problem (II-b), enumerate all graphs

G^{*} \in G

such that

f (G^{*}) = x^{*}

for each vector

x^{*} \in F^{*}

. See Figure 7 for an illustration of Step 5.

In this paper, we set a graph class

G

to be the set of rank-2 graphs. In Step 4, we solve an MILP

M (x, g; C_{2})

that is formulated on a novel idea of representing rank-2 chemical graphs, as will be discussed in Section 2.3.2. In Step 5, we use branch-and-bound algorithms for enumerating rank-2 chemical compounds [25,26].

2.3. Representing Rank-2 Chemical Graphs

This section introduces a method of modeling rank-2 chemical graphs with different cyclic structures in a unified way and proposes an MILP formulation that represents a rank-2 chemical graph G of n vertices.

2.3.1. Scheme Graphs and Tree-Extensions

Given positive integers

n^{*}

and p, a graph with

n^{*}

vertices and p edges can be represented as a subgraph of a complete graph

K_{n^{*}}

with

n^{*} (n^{*} - 1) / 2

edges. However, formulating this as an MILP may require to prepare

Ω ({(n^{*})}^{2})

variables and constraints. To reduce the number of variables and constraints in an MILP that represents a rank-2 graph, we decompose a rank-2 graph G into the core and non-core of G so that the core is represented by one of the three rank-2 polymer topologies and the non-core is a collection of trees in which the height is bounded by the core height of G. We do not specify how many subtrees will be attached to each edge in the polymer topology in advance, since otherwise we would need a different MILP for a distinct combination of such assignments of subtrees. Instead we allow each edge in a polymer topology to collect a necessary number of subtrees in our MILP (see the next section for more detail). In this section, we introduce a “scheme graph” to represent three possible rank-2 polymer topologies, an “extension” of the scheme graph to represent the core of a rank-2 graph and a “tree-extension” to represent a combination of the core and non-core of a rank-2 graph, so that any of the three kinds of rank-2 polymer topologies can be selected in a single MILp formulation.

Scheme Graphs

Formally, we define the scheme graph for rank 2 to be a pair

(K, E)

of a multigraph K and an ordered partition

E = (E_{1}, E_{2}, E_{3})

of the edge set

E (K)

. Figure 2d illustrates the scheme graph

(K = ({u_{1}, u_{2}, u_{3}, u_{4}}, E), E = (E_{1}, E_{2}, E_{3}))

. An edge in

E_{1}

is called a semi-edge, an edge in

E_{2}

is called a virtual edge and an edge in

E_{3}

is called a real edge.

Extensions of Scheme Graphs

Based on the scheme graph

(K, E)

, we construct the core of a rank-2 graph H as an “extension,” which is defined as follows (see also Figure 8). The extension

H_{core}

in Figure 9a An extension of the scheme graph

(K, E)

is defined to be a simple graph obtained from K by using each real edge

e = u v \in E_{3}

, by eliminating or replacing each virtual edge

e = u v \in E_{2}

(resp., semi-edge

e = u v \in E_{1}

) with a

u, v

-path of length at least two (resp., 1) in the core of H, where a

u, v

-path of length 1 means an edge

u v

. Figure 9a illustrates an extension

H_{core}

of the scheme graph

(K, E)

which is obtained by removing virtual edges

a_{4}, a_{5} \in E_{2}

and by replacing semi-edge

a_{1} \in E_{1}

with a path

(u_{1, 1}, v_{1, 1}, v_{2, 1}, u_{4, 1})

, semi-edge

a_{2} \in E_{1}

with a path

(u_{2, 1}, v_{3, 1}, v_{4, 1}, v_{5, 1}, u_{3, 1}

) and by using semi-edge

a_{3} \in E_{1}

and real edges

a_{6}, a_{7} \in E_{3}

. The extension

H_{core}

in Figure 9a is isomorphic to the core of the rank-2 graph H in Figure 9b. Observe that each of the least simple graphs

S (M_{i})

,

i = 1, 2, 3

in Figure 2 is obtained as an extension of the scheme graph

(K, E)

in Figure 2d.

Tree-Extensions

Let

s^{*} = | V (K) | = 4

denote the number of vertices in the scheme graph. For non-negative integers a, b and c, we consider a rank-2 graph H such that

cs (H) = s^{*} + a = 4 + a

,

ch (H) = b

and the maximum degree of a core vertex is at most c. We define an “

(a, b, c)

-tree-extension” as a minimal supergraph of all such rank-2 graphs H. Formally, the

(a, b, c)

-tree-extension (or a tree-extension) is defined to be the graph obtained by augmenting the graph K as follows:

(i): For each vertex $u_{s} \in V (K)$ , $s \in [1, s^{*}]$ , create a copy $S_{s}$ of the rooted tree $T (c - 2, c - 1, b)$ . For each $s \in [1, s^{*}]$ , let the root of rooted tree $S_{s}$ be equal to the vertex $u_{s}$ and denote by $u_{s, i}$ the copy of the i-th vertex of $T (c - 2, c - 1, b)$ in $S_{s}$ (see Figure 8a).
(ii): Create a new path $(v_{1, 1}, v_{2, 1}, \dots, v_{a, 1})$ with a vertices, where the edge between $v_{t, 1}$ and $v_{t + 1, 1}$ is denoted by $e_{t + 1}$ (see Figure 8c). For each $t \in [1, a]$ , create a copy $T_{t}$ of the rooted tree $T (c - 2, c - 1, b)$ , let the root of rooted tree $T_{t}$ be equal to the vertex $v_{1, 1}$ and denote by $v_{t, i}$ the copy of the i-th vertex of $T (c - 2, c - 1, b)$ in $T_{t}$ (see Figure 8b).
(iii): For every pair $(s, t)$ with $s \in [1, s^{*}]$ and $t \in [1, a]$ , join vertices $u_{s, 1}$ and $v_{t, 1}$ with an edge $u_{s, 1} v_{t, 1}$ (see Figure 8c).

Figure 8 illustrates the

(3, 2, 4)

-tree-extension of the scheme graph. We show how a rank-2 graph can be constructed as a subgraph of a tree-extension with some example. Figure 9b illustrates a rank-2 graph H with

n (H) = 21

,

cs (H) = 9

,

ch (H) = 2

and

θ (H) = 1

, where the maximum degree of a non-core vertex is 3. To prepare a tree-extension so that the graph H can be a subgraph of the tree-extension, we set

{cs}^{*} : = cs (H)

,

a : = t^{*} : = {cs}^{*} - s^{*} = 5

,

b : = {ch}^{*} : = ch (H) = 2

and

c : = d_{max} : = 3

. Figure 9c illustrates a subgraph

H^{'}

of the

(t^{*} = 5, {ch}^{*} = 2, d_{max} = 3)

-tree-extension such that

H^{'}

is isomorphic to the rank-2 graph H.

2.3.2. MILPs for Rank-2 Chemical Graphs

We present an outline of our MILP

M (x, g; C_{2})

in Step 4 of the framework. For integers

d_{max}, n^{*}, {cs}^{*}

,

{ch}^{*}, θ^{*} \in Z

, let

H (d_{max}, n^{*}, {cs}^{*}, {ch}^{*}, θ^{*})

denote the set of rank-2 graphs H such that the degree of each core vertex is at most 4, the degree of each non-core vertex is at most

d_{max}

,

n (H) = n^{*}

,

cs (H) = {cs}^{*}

,

ch (H) = {ch}^{*}

and

θ (H) = θ^{*}

. In this paper, we obtain the following result.

Theorem 2.

Let Λ be a set of chemical elements, Γ be a set of adjacency-configurations, where

| Λ | \leq | Γ |

, and

k = 2 | Λ | + 2 | Γ | + 15

. Given integers

d_{max} \in {3, 4}

,

n^{*} \geq 3

,

{cs}^{*} \geq 3

{ch}^{*} \geq 0

and

θ^{*}

, there is an MILP

M (x, g; C_{2})

that consists of variable vectors

x \in R^{k}

and

g \in R^{q}

for some integer

q = O (| Γ | \cdot {cs}^{*} \cdot {(d_{max} - 1)}^{{ch}^{*}})

and a set

C_{2}

of

O (| Γ | + {cs}^{*} \cdot {(d_{max} - 1)}^{{ch}^{*}})

constraints on these variables such that:

(x^{*}, g^{*})

is feasible to

M (x, g; C_{2})

if and only if

g^{*}

forms a rank-2 chemical graph

G^{*} = (H, α, β) \in G (Λ, Γ)

such that

H \in H (d_{max}, n^{*}, {cs}^{*}, {ch}^{*}, θ^{*})

and

f (G^{*}) = x^{*}

.

Note that our MILP requires only

O (n^{*})

variables and constraints when the maximum core height of a subtree in the non-core of

G^{*}

and

| Γ |

are constant. We formulate an MILP in Theorem 2 so that such a graph H is selected as a subgraph of the scheme graph.

We explain the basic idea of our MILP. Define

t^{*} ≜ {cs}^{*} - s^{*},

c^{*} ≜ | E_{1} \cup E_{2} | f o r (K, E = (E_{1}, E_{2}, E_{3})),

n_{tree} ≜ 1 + 2 ({(d_{max} - 1)}^{{ch}^{*}} - 1) / (d_{max} - 2) a n d n_{in} ≜ 1 + 2 ({(d_{max} - 1)}^{{ch}^{*} - 1} - 1) / (d_{max} - 2),

where

n_{tree}

and

n_{in}

are the numbers of vertices and non-leaf vertices in the rooted tree

T (d_{max} - 2, d_{max} - 1, {ch}^{*})

, respectively. The MILP mainly consists of the following three types of constraints.

Constraints for selecting a rank-2 graph H as a subgraph of the $(t^{*}, {ch}^{*}, d_{max})$ -tree-extension of the scheme graph $(K, E)$ ;
Constraints for assigning chemical elements to vertices and multiplicity to edges to determine a chemical graph $G = (H, α, β)$ ;
Constraints for computing descriptors from the selected rank-2 chemical graph G; and
Constraints for reducing the number of rank-2 chemical graphs that are isomorphic to each other but can be represented by the above constraints.

In the constraints of 1, we treat each edge in the tree-extension as a directed edge because describing some condition for H to belong to

H (d_{max}, n^{*}, {cs}^{*}, {ch}^{*}, θ^{*})

becomes slightly easier than the case of undirected graphs. More formally we prepare the following.

(i): In the scheme graph $(K, E)$ , denote the edges in $E_{1} \cup E_{2} \cup E_{3}$ by $E_{1} = {a_{1}, a_{2}, \dots, a_{| E_{1} |}}$ , $E_{2} = {a_{| E_{1} | + 1}, \dots, a_{c^{*}}}$ and $E_{3} = {a_{c^{*} + 1}, \dots, a_{m}}$ (where $c^{*} = | E_{1} \cup E_{2} |$ ), and regard each edge $a_{i} = u_{s, 1} u_{s^{'}, 1} \in E_{1} \cup E_{2} \cup E_{3}$ as a directed edge from one end-vertex $u_{s, 1}$ to the other end-vertex $u_{s^{'}, 1}$ with $s < s^{'}$ . Let $a (i)$ be a binary variable for each edge $a_{i}$ , $i \in [1, m]$ .
(ii): In each tree $S_{s}$ (resp., $T_{t}$ ) in the tree-extension, we regard each edge $e_{s, i}^{'}$ , $i \geq 2$ in the rooted tree $S_{s}$ , $s \in [1, s^{*}]$ (resp., $e_{t, i}$ , $i \geq 2$ in the rooted tree $T_{t}$ , $t \in [1, t^{*}]$ ) as a directed edge from vertex $u_{s, prt (i)}$ to vertex $u_{s, i}$ (resp., from vertex $v_{t, prt (i)}$ to vertex $v_{t, i}$ ). Let $u (s, i)$ (resp., $v (t, i)$ ) be a binary variable for vertex $u_{s, i}$ , $s \in [1, s^{*}]$ (resp., $t \in [1, t^{*}]$ ) and $i \in [1, n_{tree}]$ ;
(iii): In the path $P_{t^{*}}$ consisting of the roots of trees $T_{t}$ , $[t \in 1, t^{*}]$ , we regard each edge $e_{t}$ , $t \in [2, t^{*}]$ as a directed edge from vertex $v_{t - 1, 1}$ to vertex $v_{t, 1}$ ; and
(iv): We regard each edge $u_{s, 1} v_{t, 1}$ for $s \in [1, s^{*}]$ and $t \in [1, t^{*}]$ as two directed edges, one directed from vertex $u_{s, 1}$ to vertex $v_{t, 1}$ and the other directed oppositely. Let $e (s, t)$ (resp., $e (t, s)$ ) be a binary variable of directed edge $(u_{s, 1}, v_{t, 1})$ (resp., $(v_{t, 1}, u_{s, 1})$ ).

Based on these, we include constraints with some more additional variables so that a selected subgraph H is a connected rank-2 graph. See constraints Equations (A10) to (A42) in Appendix A for the details.

In the constraints of 2, we prepare an integer variable

\tilde{α} (u)

for each vertex u in the tree-extension that represents the chemical element

α (u) \in Λ

if u is in a selected graph H (or

\tilde{α} (u) = 0

otherwise) and an integer variable

\tilde{β} (e) \in [0, 3]

(resp.,

\hat{β} (e) \in [0, 3]

) for each edge e (resp.,

e = e (s, t)

or

e (t, s)

,

s \in [1, s^{*}]

,

t \in [1, t^{*}]

) in the tree-extension that represents the multiplicity

β (e) \in [1, 3]

if e is in a selected graph H (or

\tilde{β} (e)

or

\hat{β} (e)

takes 0 otherwise). This determines a chemical graph

G = (H, α, β)

. Also we include constraints for a selected chemical graph G to satisfy the valence condition

(α (u), α (v), β (u v)) \in Γ

for each edge

u v \in E

. See constraints Equations (A43) to (A61) in Appendix A for the details.

In the constraints of 3, we introduce a variable for each descriptor and constraints with some more variables to compute the value of each descriptor in

f (G)

for a selected chemical graph G. See constraints Equations (A62) to (A113) in Appendix A for the details.

With constraints 1 to 3, our MILP formulation already represents a rank-2 chemical graph G and a feature vector

x \in R^{k}

so that

x = f (G)

holds. In the constraints of 4, we include some additional constraints so that the search space required for an MILP solver to solve an instance of our MILP problem is reduced. For this, we consider a graph-isomorphism of rooted subtrees of each tree

S_{s}

or

T_{s}

and define a canonical form among subtrees that are isomorphic to each other. We try to eliminate a chemical graph G that has a subtree in

S_{s}

or

T_{s}

that is not a canonical form. See constraints Equations (A114) to (A119) in Appendix A for the details.

3. Results

We implemented our method of Steps 1 to 5 for inferring rank-2 chemical graphs and conducted experiments to evaluate the computational efficiency for three chemical properties

π

: octanol/water partition coefficient (Kow), melting point (Mp), and boiling point (Bp). We executed the experiments on a PC with Intel Core i5 1.6 GHz CPU and 8GB of RAM running under the Mac OS operating system version 10.14.6. We show 2D drawings of some of the inferred chemical graphs, where ChemDoodle version 10.2.0 is used for constructing the drawings.

Results on Phase 1.

Step 1. We set a graph class

G

to be the set of all rank-2 chemical graphs. For each property

π \in

{Kow, Mp, Bp}, we select a set

Λ

of chemical elements and collected a data set

D_{π}

on rank-2 chemical graphs over

Λ

provided by HSDB from PubChem. To construct the data set, we eliminated chemical compounds that have at most three carbon atoms or contain a charged element such as

N^{+}

or an element

a \in Λ

in which the valence is different from our setting of valence function

val

.

Table 1 shows the size and range of data sets that we prepared for each chemical property in Step 1, where we denote the following:

-: $π$ : one of the chemical properties Kow, Mp and Bp;
-: $| D_{π} |$ : the size of data set $D_{π}$ for property $π$ ;
-: $Λ$ : the set of chemical elements over data set $D_{π}$ (hydrogen atoms are added at the final stage);
-: $| Γ |$ : the number of tuples in $Γ$ ;
-: $[\underset{̲}{n}, \bar{n}]$ : the minimum and maximum number $n (G)$ of non-hydrogen atoms over data set $D_{π}$ ;
-: $[\underset{̲}{cs}, \bar{cs}]$ , $[\underset{̲}{ch}, \bar{ch}]$ : the minimum and maximum core size and core height over chemical compounds in $D_{π}$ , respectively;
-: $[\underset{̲}{θ}, \bar{θ}]$ : the minimum and maximum values of the topological parameter $θ (G)$ over data set $D_{π}$ ; and
-: $[\underset{̲}{a}, \bar{a}]$ : the minimum and maximum values of $a (G)$ in $π$ over data set $D_{π}$ .

Step 2. We used a feature function f that consists of the descriptors defined in Section 2.1.

Step 3. We used scikit-learn version 0.21.6 with Python 3.7.4 to construct ANNs

N

where the tool and activation function are set to be MLPRegressor and ReLU, respectively. We tested several different architectures of ANNs for each chemical property. To evaluate the performance of the resulting prediction function

ψ_{N}

with cross-validation, we partition a given data set

D_{π}

into five subsets

D_{π}^{(i)}

,

i \in [1, 5]

randomly, where

D_{π} ∖ D_{π}^{(i)}

is used for a training set and

D_{π}^{(i)}

is used for a test set in five trials

i \in [1, 5]

. For a set

{y_{1}, y_{2}, \dots, y_{N}}

of observed values and a set

{ψ_{1}, ψ_{2}, \dots, ψ_{N}}

of predicted values, we define the coefficient of determination to be

R^{2} ≜ 1 - \frac{\sum_{j \in [1, N]} {(y_{j} - ψ_{j})}^{2}}{\sum_{j \in [1, N]} {(y_{j} - \bar{y})}^{2}}

, where

\bar{y} = \frac{1}{N} \sum_{j \in [1, N]} y_{j}

. Table 2 shows the results on Steps 2 and 3, where

-: k: the number of descriptors for the chemical compounds in data set $D_{π}$ for property $π$ ;
-: Activation: the choice of activation function;
-: Architecture: $(a, b, 1)$ consists of an input layer with a nodes, a hidden layer with b nodes, and an output layer with a single node, where a is equal to the number of descriptors;
-: L-time: the average time (sec.) to construct ANNs for each trial;
-: test $R^{2}$ (ave.): the average of coefficient of determination over the five test sets; and
-: test $R^{2}$ (best): the largest value of coefficient of determination over the five test sets.

For each chemical property

π

, we selected the ANN

N

that attained the best test

R^{2}

score among the five ANNs to formulate an MILP

M (x, y, z; C_{1})

in the second phase.

Results on Phase 2.

We implemented Steps 4 and 5 in Phase 2 as follows.

Step 4. In this step, we solve the MILP

M (x, y, g; C_{1}, C_{2})

formulated based on the ANN

N

obtained in Phase 1. To solve an MILP in Step 4, we use CPLEX version 12.10. In our experiment, we choose a target value

y^{*} \in [\underset{̲}{a}, \bar{a}]

and fix or bound some descriptors in our feature vector as follows:

-: Fix variable $θ$ that represents the polymer parameter $θ (H)$ to be each integer in ${- 2, 0, 2}$ ;
-: Set $d_{\max}$ to be each of 3 and 4;
-: Fix $n^{*}$ to be some four integers in ${15, 19, 20, 25, 30}$ for $θ \in {- 2, 0}$ and ${15, 19, 20, 22, 25}$ for $θ = 2$ ;
-: Choose three integers from $[7, 16]$ and fix ${cs}^{*}$ to be each of the three integers;
-: Fix ${ch}^{*}$ to be each of the four integers in $[2, 5]$ .

Based on the above setting, we generated 12 instances for each

n^{*}

. We set

ε = 0.02

in Step 4.

Table 3, Table 4, Table 5, Table 6, Table 7 and Table 8 show the results of Step 4 for

d_{max} = 3

and 4, respectively, where we denote the following:

-: $y_{π}^{*}$ : a target value in $[\underset{̲}{a}, \bar{a}]$ for a property $π$ ;
-: $n^{*}$ : a specified number of vertices in $[\underset{̲}{n}, \bar{n}]$ ;
-: $| F^{*} | / #$ I: #I means the number of MILP instances in Step 4 (where #I = 12), and $| F^{*} |$ means the size of set $F^{*}$ of vectors $x^{*}$ generated from all feasible instances among the #I MILP instances in Step 4;
-: IP-time: the average time (sec.) to solve one of the #I MILP instances to find a set $F^{*}$ of vectors $x^{*}$ .

Figure 10a–c illustrate some rank-2 chemical graphs

G^{*}

with

θ (G^{*}) = - 2

constructed from the vector

g^{*}

obtained by solving the MILP in Step 4.

Figure 11a–c illustrate some rank-2 chemical graphs

G^{*}

with

θ (G^{*}) = 0

constructed from the vector

g^{*}

obtained by solving the MILP in Step 4.

Figure 12a–c illustrate some rank-2 chemical graphs

G^{*}

with

θ (G^{*}) = 2

constructed from the vector

g^{*}

obtained by solving the MILP in Step 4.

Step 5. In this step, we modified the algorithms proposed by Tamura et al. [25] and Yamashita et al. [26] to enumerate all rank-2 graphs

G^{*} \in G

such that

f (G^{*}) = x^{*}

for each

x^{*} \in F^{*}

. We stop the execution when either the total number of graphs inferred over all vectors

x^{*} \in F^{*}

exceeds 100 or the execution time exceeds one hour.

Table 3, Table 4, Table 5, Table 6, Table 7 and Table 8 show the results on Step 5 for

d_{max} = 3

and 4, respectively,

-: $# G^{*}$ : the number of all (or up to 100) rank-2 chemical graphs $G^{*}$ that are computed under 1 h time limit in Step 5, where $f (G^{*}) = x^{*}$ for some $x^{*} \in F^{*}$ . (Note that $| F^{*} |$ such graphs $G^{*}$ have been found in Step 4, and Figure 10, Figure 11 and Figure 12 illustrate some of such graphs $G^{*}$ .);
-: G-time: the running time (sec.) to execute Step 5, where “>1 h” means that the execution time exceeds the limit.

We also conducted some additional experiments to demonstrate that our MILP-based method is flexible to control conditions on the inference of chemical graphs. In Step 3, we constructed an ANN

N_{π}

for each of the three chemical properties

π \in {

Kow, Mp, Bp}, and formulated the inverse problem of each ANN

N_{π}

as an MILP

M_{π}

. Since the set of descriptors is common to all three properties Kow, Mp, and Bp, it is possible to infer a rank-2 chemical graph

G^{*}

that satisfies a target value

y_{π}^{*}

for each of the three properties at the same time (if one exists). We specify the size of graph so that

n : = 22

, core size := 14, core height := 3,

θ : = - 2

and

d_{max} : = 3

, and set target values with

y_{K ow}^{*} : = 5

,

y_{Mp}^{*} : = 150

and

y_{Bp}^{*} : = 250

in an MILP that consists of the three MILPs

M_{K ow}

,

M_{Mp}

and

M_{Bp}

. The MILP was solved in 268.11 (sec) and we obtained a rank-2 chemical graph

G^{*}

illustrated in Figure 10d.

4. Discussion

In this paper, we proposed a new method for the inverse QSAR/QSPR to rank-2 chemical graphs by significantly enhancing the framework due to Azam et al. [18], Zhang et al. [20], and Ito et al. [21], and implemented it for inferring rank-2 chemical graphs using the algorithms for enumerating rank-2 chemical graphs due to Tamura et al. [25] and Yamashita et al. [26]. From the results on some computational experiments, we observe that the proposed method runs efficiently for an instance with

n^{*} \leq 30

non-hydrogen atoms up to Step 4 and an instance with

n^{*} \leq 15

non-hydrogen atoms up to Step 5. Due to this development, the ratio of chemical compounds covered in the PubChem database increased from 16.26% to 44.5%. It is left as future work to apply our new method for the inverse QSAR/QSPR to a wider class of graphs. The ratio of the number of chemical graphs with rank at most 3 (resp., 4) to the number of all chemical graphs in database PubChem is 68.8% (resp., 84.7%). Among rank-4 chemical compounds, Remdesivir

C_{27} H_{35} N_{6} O_{8} P

, an antiviral medication, which is being studied as a possible post-infection treatment for COVID-19, has a chemical graph G with

r (G) = 4

,

n (G) = 42

,

cs (G) = 24

, and

ch (G) = 8

. The number of polymer topologies with rank 3 (resp., 4) such that the maximum degree is at most 4 is 12 (resp., 73). Our MILP formulation can be easily extended to the case of rank 3 or 4 by replacing the current set of constraints for the scheme graph with a set of those for a new scheme graph that is designed for rank-3 or -4 polymer topologies.

Author Contributions

Conceptualization, H.N. and T.A.; methodology, H.N.; software, J.Z., C.W., and A.S.; validation, J.Z., C.W., A.S., and H.N.; formal analysis, H.N.; data resources, H.N. and T.A.; writing—original draft preparation, H.N.; writing—review and editing, T.A.; project administration, H.N.; funding acquisition, T.A. All authors have read and agreed to the published version of the manuscript.

Funding

H.N. and T.A. were partially supported by the Japan Society for the Promotion of Science, Japan, under Grant #18H04113.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. All Constraints in an MILP Formulation for Rank-2 Chemical Graphs

To formulate an MILP that represents a chemical graph

G = (H, α, β)

, we distinguish a tuple

(a, b, k)

from a tuple

(b, a, k)

. For a tuple

γ = (a, b, k) \in Λ \times Λ \times {1, 2, 3}

, let

\bar{γ}

denote the tuple

(b, a, k)

. Let

Γ_{<} ≜ {\bar{γ} ∣ γ \in Γ_{>}}

. We call a tuple

γ = (a, b, k) \in Λ \times Λ \times {1, 2, 3}

proper if

k \leq min {val (a), val (b)} a n d k \leq max {val (a), val (b)} - 1,

where the latter is assumed because otherwise G must consist of two atoms of

a = b

. Assume that each tuple

γ \in Γ

is proper. Let

ϵ

be a fictitious chemical element that represents null, call a tuple

(a, b, 0)

with

a, b \in Λ \cup {ϵ}

fictitious, and define

Γ_{0}

to be the set of all fictitious tuples; i.e.,

Γ_{0} = {(a, b, 0) ∣ a, b \in Λ \cup {ϵ}}

. To represent chemical elements

e \in Λ \cup {ϵ} \cup Γ

in an MILP, we encode these elements

e

into some integers denoted by

[e]

. Assume that, for each element

a \in Λ

,

[a]

is a positive integer and that

[ϵ] = 0

.

Appendix A.1. Applicability Domain

We use the range-based method to define an applicability domain for our method. For this, we find the range (the minimum and maximum) of each descriptor over all relevant chemical compounds and represent each range as a set of linear constraints in the constraint set

C_{1}

of our MILP formulation. Recall that

D_{π}

stands for a set of chemical graphs used for constructing a prediction function. However, the number of examples in

D_{π}

may not be large enough to capture a general feature on the structure of chemical graphs. For this, we also use some data set from the whole set

DB

of chemical graphs in a database. Let

{DB}_{G}^{(i)}

denote the set of chemical graphs

G \in DB \cap G

such that

n (G) = i

for each integer

i \geq 1

. Formally the set of variables and constraints is given as follows.

AD constraints in $C_{1}$ :

constants:

Integers

{cs}^{*} \geq 3

and

{ch}^{*} \geq 1

; An integer

d_{max} \in {3, 4}

;

An integer

n^{*} \in [{cs}^{*} + 1, {cs}^{*} \cdot {(d_{max} - 1)}^{{ch}^{*}}]

;

variables for descriptors in x:

A real variable

κ_{1} \geq 0

:

κ_{1}

represents

κ_{1} (H)

;

dg (i) \in [0, n^{*}]

(i \in [1, 4])

:

dg (i)

represents the number of vertices of degree i in H;

Mass \in Z

:

Mass

represents

\sum_{v \in V} {mass}^{*} (α (v))

;

{ce}^{co} (a) \in [0, n^{*}]

,

a \in Λ

:

{ce}^{co} (a)

represents the number of vertices of chemical element

a in the core of H;

{ce}^{nc} (a) \in [0, n^{*}]

,

a \in Λ

:

{ce}^{nc} (a)

represents the number of vertices of chemical element

a in the non-core of H;

b^{co} (k) \in [0, 2 n^{*}]

,

k \in [1, 3]

:

b^{co} (k)

represents the number of k-bonds in the core of H;

b^{nc} (k) \in [0, 2 n^{*}]

,

k \in [1, 3]

:

b^{nc} (k)

represents the number of k-bonds in the non-core of H;

{ac}^{co} (γ) \in [0, n^{*}]

,

γ \in Γ_{<} \cup Γ_{=}

:

{ac}^{co} (γ)

represents the number of core edges

in H that are assigned tuple

γ \in Γ_{<}

;

{ac}^{nc} (γ) \in [0, n^{*}]

,

γ \in Γ_{<} \cup Γ_{=}

:

{ac}^{nc} (γ)

represents the number of non-core edges in

H that are assigned tuple

γ \in Γ_{<}

;

constraints:

\begin{matrix} n^{*} min_{G \in D_{π} \cup {DB}_{G}^{(n^{*})}} \frac{κ_{1} (G)}{n (G)} \leq κ_{1} \leq n^{*} max_{G \in D_{π} \cup {DB}_{G}^{(n^{*})}} \frac{κ_{1} (G)}{n (G)}, \end{matrix}

(A1)

\begin{matrix} n^{*} min_{G \in D_{π} \cup {DB}_{G}^{(n^{*})}} \frac{{dg}_{i} (G)}{n (G)} \leq dg (i) \leq n^{*} max_{G \in D_{π} \cup {DB}_{G}^{(n^{*})}} \frac{{dg}_{i} (G)}{n (G)}, & i \in [1, 4], \end{matrix}

(A2)

\begin{matrix} n^{*} min_{G \in D_{π} \cup {DB}_{G}^{(n^{*})}} \bar{ms} (G) \leq Mass \leq n^{*} max_{G \in D_{π} \cup {DB}_{G}^{(n^{*})}} \bar{ms} (G), \end{matrix}

(A3)

\begin{matrix} n^{*} min_{G \in D_{π} \cup {DB}_{G}^{(n^{*})}} \frac{{ce}_{a}^{co} (G)}{n (G)} \leq {ce}^{co} (a) \leq n^{*} max_{G \in D_{π} \cup {DB}_{G}^{(n^{*})}} \frac{{ce}_{a}^{co} (G)}{n (G)}, & a \in Λ, \end{matrix}

(A4)

\begin{matrix} n^{*} min_{G \in D_{π} \cup {DB}_{G}^{(n^{*})}} \frac{{ce}_{a}^{nc} (G)}{n (G)} \leq {ce}^{nc} (a) \leq n^{*} max_{G \in D_{π} \cup {DB}_{G}^{(n^{*})}} \frac{{ce}_{a}^{nc} (G)}{n (G)}, & a \in Λ, \end{matrix}

(A5)

\begin{matrix} (n^{*} + 1) min_{G \in D_{π} \cup {DB}_{G}^{(n^{*})}} \frac{b_{k}^{co} (G)}{n (G) + 1} \leq b^{co} (k) \leq (n^{*} + 1) max_{G \in D_{π} \cup {DB}_{G}^{(n^{*})}} \frac{b_{k}^{co} (G)}{n (G) + 1}, & k \in [2, 3], \end{matrix}

(A6)

\begin{matrix} (n^{*} + 1) min_{G \in D_{π} \cup {DB}_{G}^{(n^{*})}} \frac{b_{k}^{nc} (G)}{n (G) + 1} \leq b^{nc} (k) \leq (n^{*} + 1) max_{G \in D_{π} \cup {DB}_{G}^{(n^{*})}} \frac{b_{k}^{nc} (G)}{n (G) + 1}, & k \in [2, 3], \end{matrix}

(A7)

\begin{matrix} (n^{*} + 1) min_{G \in D_{π} \cup {DB}_{G}^{(n^{*})}} \frac{{ac}_{γ}^{co} (G)}{n (G) + 1} \leq {ac}^{co} (γ) \leq (n^{*} + 1) max_{G \in D_{π} \cup {DB}_{G}^{(n^{*})}} \frac{{ac}_{γ}^{co} (G)}{n (G) + 1}, & γ \in Γ, \end{matrix}

(A8)

\begin{matrix} (n^{*} + 1) min_{G \in D_{π} \cup {DB}_{G}^{(n^{*})}} \frac{{ac}_{γ}^{nc} (G)}{n (G) + 1} \leq {ac}^{nc} (γ) \leq (n^{*} + 1) max_{G \in D_{π} \cup {DB}_{G}^{(n^{*})}} \frac{{ac}_{γ}^{nc} (G)}{n (G) + 1}, & γ \in Γ . \end{matrix}

(A9)

In the following, we derive an MILP

M (x, g; C_{2})

that satisfies the condition in Theorem 2. Let

d_{max} \in {3, 4}

,

n^{*} \geq 3

,

{cs}^{*} \geq 3

{ch}^{*} \geq 0

and

θ^{*}

be given integers. We describe the set

C_{2}

with several sets of constraints.

Appendix A.2. Construction of Scheme Graph and Tree-Extension

We infer a subgraph H such that the maximum degree is

d_{max} \in {3, 4}

,

n (H) = n^{*}

,

cs (H) = {cs}^{*}

and

ch (H) = {ch}^{*}

. For this, we first construct the

(t^{*}, {ch}^{*}, d_{max})

-tree-extension of the scheme graph

(K = (V_{K} = {u_{1}, \dots, u_{s^{*}}}, E_{K} = {a_{1}, a_{2}, \dots, a_{m}}), E = (E_{1}, E_{2}, E_{3}))

. We use the following notations: For

j \in [1, 3]

and

s \in [1, s^{*}]

, let

E_{j}^{+} (s)

(resp.,

E_{j}^{-} (s)

) denote the set of indices i of edges

a_{i} \in E_{i}

such that the tail (resp., head) of

a_{i}

is

u_{s, 1}

. Let

E_{j, k}^{+} (s) ≜ E_{j}^{+} (s) \cup E_{k}^{+} (s)

,

E_{j, k}^{-} (s) ≜ E_{j}^{-} (s) \cup E_{k}^{-} (s)

,

E_{j} (s) ≜ E_{j}^{+} (s) \cup E_{j}^{-} (s)

and

E_{j, k} (s) ≜ E_{j} (s) \cup E_{k} (s)

.

As described in Section 2.3.1, some edge

a (i) \in E_{1} \cup E_{2}

may be replaced with a subpath

P_{i}

of

(v_{1, 1}, v_{1, 2}, \dots, v_{t^{*}, 1})

, which consists of the roots of trees

T_{1}, T_{2}, \dots, T_{t^{*}}

. We assign color i to the vertices in such a subpath

P_{i}

by setting a variable

χ (t)

of each vertex

v_{t, 1} \in V (P_{i})

to be i. For each edge

u_{s, 1} v_{t, 1}

, we prepare a binary variable

e (s, t)

to denote that edge

u_{s, 1} v_{t, 1}

is used (resp., not used) in a selected graph H when

e (s, t) = 1

(resp.,

e (s, t) = 0

). We also include constraints necessary for the variables to satisfy a degree condition at each of the vertices

u_{s, 1}

,

s \in [1, s^{*}]

and

v_{t, 1}

,

t \in [1, t^{*}]

.

constants:

Integers

s^{*} = | V_{K} |

,

c^{*} = | E_{1} \cup E_{2} |

,

{cs}^{*} (\geq s^{*})

,

n^{*} (\geq {cs}^{*})

and

{ch}^{*} \geq 0

;

{\underset{̲}{d}}^{+} (s)

,

s \in [1, s^{*}]

: a lower bound on the out-degree of vertex

u_{s, 1}

in H;

{\underset{̲}{d}}^{-} (s)

,

s \in [1, s^{*}]

: a lower bound on the in-degree of vertex

u_{s, 1}

in H;

{\bar{d}}^{+} (s)

,

s \in [1, s^{*}]

: an upper bound on the out-degree of vertex

u_{s, 1}

in H;

{\bar{d}}^{-} (s)

,

s \in [1, s^{*}]

: an upper bound on the in-degree of vertex

u_{s, 1}

in H;

variables:

a (i) \in {0, 1}

,

i \in E_{1} \cup E_{3}

:

a (i)

represents edge

a_{i} \in E_{1} \cup E_{3}

(

a (i) = 1

,

i \in E_{1}

)

(

a (i) = 1

⇔ edge

a_{i}

is used in H);

e (s, t), e (t, s) \in {0, 1}

,

s \in [1, s^{*}]

,

t \in [1, t^{*}]

:

e (s, t)

(resp.,

e (t, s)

) represents

direction

(u_{s, 1}, v_{t, 1})

(resp.,

(v_{t, 1}, u_{s, 1})

), where

e (s, t) = 1

(resp.,

e (t, s) = 1

) ⇔

edge

u_{s, 1}, v_{t, 1}

is used in H and direction

(u_{s, 1}, v_{t, 1})

(resp.,

(v_{t, 1}, u_{s, 1})

) is assigned

to edge

u_{s, 1}, v_{t, 1}

;

χ (t) \in [1, c^{*}]

,

t \in [1, t^{*}]

:

χ (t)

represents the color assigned to vertex

v_{t, 1}

(

χ (t) = c

⇔ vertex

v_{t, 1}

is assigned color c);

clr (c) \in [0, n^{*} - s^{*}]

,

c \in [1, c^{*}]

: the number of vertices

v_{t, i}

with color c;

\deg^{co} + (s) \in [1, 4]

,

s \in [1, s^{*}]

: the out-degree of vertex

u_{s, 1}

in the core of H;

\deg^{co} - (s) \in [1, 4]

,

s \in [1, s^{*}]

: the in-degree of vertex

u_{s, 1}

in the core of H;

δ_{clr} (t, c) \in {0, 1}

,

t \in [1, t^{*}]

,

c \in [1, c^{*}]

(

δ_{clr} (t, c) = 1

⇔

χ (t) = c

);

constraints:

\begin{matrix} \sum_{c \in [1, c^{*}]} δ_{clr} (t, c) = 1, & t \in [1, t^{*}], \end{matrix}

(A10)

\begin{matrix} \sum_{c \in [1, c^{*}]} c \cdot δ_{clr} (t, c) = χ (t), & t \in [1, t^{*}], \end{matrix}

(A11)

\begin{matrix} \sum_{t \in [1, t^{*}]} δ_{clr} (t, c) = clr (c), & c \in [1, c^{*}], \end{matrix}

(A12)

\begin{matrix} e (s, t) + e (t, s) \leq 1, & s \in [1, s^{*}], t \in [1, t^{*}], \end{matrix}

(A13)

\begin{matrix} \sum_{s \in [1, s^{*}] ∖ {head (c)}} e (t, s) \leq 1 - δ_{clr} (t, c), & c \in [1, c^{*}], t \in [1, t^{*}], \end{matrix}

(A14)

\begin{matrix} \sum_{s \in [1, s^{*}] ∖ {tail (c)}} e (s, t) \leq 1 - δ_{clr} (t, c), & c \in [1, c^{*}], t \in [1, t^{*}], \end{matrix}

(A15)

\begin{matrix} \sum_{i \in E_{1, 3}^{-} (s)} a (i) + \sum_{t \in [1, t^{*}]} e (t, s) = \deg^{co} - (s), & s \in [1, s^{*}], \end{matrix}

(A16)

\begin{matrix} \sum_{i \in E_{1, 3}^{+} (s)} a (i) + \sum_{t \in [1, t^{*}]} e (s, t) = \deg^{co} + (s), & s \in [1, s^{*}], \end{matrix}

(A17)

\begin{matrix} {\underset{̲}{d}}^{+} (s) \leq \deg^{co} + (s) \leq {\bar{d}}^{+} (s), & s \in [1, s^{*}], \end{matrix}

(A18)

\begin{matrix} {\underset{̲}{d}}^{-} (s) \leq \deg^{co} - (s) \leq {\bar{d}}^{-} (s), & s \in [1, s^{*}] . \end{matrix}

(A19)

Appendix A.3. Specification for Chemical Graphs with Rank 2

To generate any of the three rank-2 polymer topologies in

PT (2, 4)

, we use the scheme graph

(K = (V_{K} = {u_{1}, u_{2}, u_{3}, u_{4}}, E_{K})

,

E = (E_{1}, E_{2}, E_{3}))

in Figure 2d, where

s^{*} = | V (K) | = 4

,

c^{*} = | E_{1} \cup E_{2} | = 5

,

E_{1} = {a_{1} = (u_{1}, u_{4}), a_{2} = (u_{2}, u_{3}), a_{3} = (u_{2}, u_{4})}

,

E_{2} = {a_{4} = (u_{1}, u_{2}), a_{5} = (u_{3}, u_{4})}

and

E_{3} = {a_{6} = (u_{1}, u_{2}), a_{7} = (u_{3}, u_{4})}

. Recall that each color

i \in [1, c^{*}]

is assigned to edge

a_{i} \in E_{1} \cup E_{2}

. We impose some more constraints on the degree of each of the vertices

u_{s, 1}

,

s \in [1, s^{*}]

and

v_{t, 1}

,

t \in [1, t^{*}]

so that the core of a selected graph H satisfies one of the three least simple graphs in Figure 2a–c. We also let a variable

θ

mean the topological parameter

θ (H)

of a selected subgraph H.

constants:

s^{*} = 4

,

c^{*} = 5

,

E_{1}^{-} (1) = \emptyset

,

E_{2}^{-} (1) = \emptyset

,

E_{3}^{-} (1) = \emptyset

,

E_{1}^{+} (1) = {1}

,

E_{2}^{+} (1) = {4}

,

E_{3}^{+} (1) = {6}

,

E_{1}^{-} (2) = \emptyset

,

E_{2}^{-} (2) = {4}

,

E_{3}^{-} (2) = {6}

,

E_{1}^{+} (2) = {2, 3}

,

E_{2}^{+} (2) = \emptyset

,

E_{3}^{+} (2) = \emptyset

,

E_{1}^{-} (3) = {2}

,

E_{2}^{-} (3) = \emptyset

,

E_{3}^{-} (3) = \emptyset

,

E_{1}^{+} (3) = \emptyset

,

E_{2}^{+} (3) = {5}

,

E_{3}^{+} (3) = {7}

,

E_{1}^{-} (4) = {1, 3}

,

E_{2}^{-} (4) = {5}

,

E_{3}^{-} (4) = {7}

,

E_{1}^{+} (4) = \emptyset

,

E_{2}^{+} (4) = \emptyset

,

E_{3}^{+} (4) = \emptyset

,

{\underset{̲}{d}}^{-} (1) = 0

,

{\bar{d}}^{-} (1) = 0

,

{\underset{̲}{d}}^{+} (1) = 2

,

{\bar{d}}^{+} (1) = 2

,

{\underset{̲}{d}}^{-} (2) = 1

,

{\bar{d}}^{-} (2) = 2

,

{\underset{̲}{d}}^{+} (2) = 1

,

{\bar{d}}^{+} (2) = 2

,

{\underset{̲}{d}}^{-} (3) = 1

,

{\bar{d}}^{-} (3) = 1

,

{\underset{̲}{d}}^{+} (3) = 1

,

{\bar{d}}^{+} (3) = 2

,

{\underset{̲}{d}}^{-} (4) = 2

,

{\bar{d}}^{-} (4) = 3

,

{\underset{̲}{d}}^{+} (4) = 0

,

{\bar{d}}^{+} (4) = 0

,

variables:

θ \in [- n^{*}, n^{*}]

: The topology-parameter

θ (H)

for rank 2;

constraints:

\begin{matrix} a (2) + clr (2) & \geq 1, \end{matrix}

(A20)

\begin{matrix} a (3) + clr (3) + clr (4) & \geq 1, \end{matrix}

(A21)

\begin{matrix} clr (4) & \geq clr (5), \end{matrix}

(A22)

\begin{matrix} clr (3) & \leq clr (2) + 1, \end{matrix}

(A23)

\begin{matrix} clr (3) & \leq clr (1) + 1 + n^{*} (3 - \deg^{co -} (4)), \end{matrix}

(A24)

\begin{matrix} - θ & \leq 1 + clr (2) + n^{*} (2 - \deg^{co} + (3)), \end{matrix}

(A25)

\begin{matrix} - θ & \geq 1 + clr (2) - n^{*} (2 - \deg^{co} + (3)), \end{matrix}

(A26)

\begin{matrix} θ & \leq n^{*} (4 - \deg^{co} + (2) - \deg^{co} - (2)), \end{matrix}

(A27)

\begin{matrix} θ & \geq - n^{*} (4 - \deg^{co} + (2) - \deg^{co} - (2)), \end{matrix}

(A28)

\begin{matrix} θ & \leq 1 + clr (3) + n^{*} (3 - \deg^{co} - (4)), \end{matrix}

(A29)

\begin{matrix} θ & \geq 1 + clr (3) - n^{*} (3 - \deg^{co} - (4)) . \end{matrix}

(A30)

Appendix A.4. Selecting A Subgraph

We prepare a binary variable

u (s, i)

(resp.,

v (t, i)

) for each vertex

u_{s, i}

in tree

S_{s}

(resp.,

v_{t, i}

in tree

T_{t}

). We include constraints so that the path

(v_{1, 1}, v_{1, 2}, \dots, v_{t^{*}, 1})

is partitioned into subpaths

P_{c}

,

c \in [1, c^{*}]

, where possibly some

P_{c}

is empty, and the resulting subgraph H becomes a connected rank-2 graph with

n (H) = n^{*}

,

cs (H) = {cs}^{*}

,

ch (H) = {ch}^{*}

and

θ (H) = θ^{*}

.

constants:

Integers

d_{max} \in {3, 4}

,

{ch}^{*} \geq 0

;

Prepare the set

Cld (i)

of the indices of children of a vertex

v_{i}

the index

prt (i)

of the parent of a non-root vertex

v_{i}

, and

the set

Dst (h)

of indices i such that the height of a vertex

v_{i}

is h

in the rooted tree

T (2, d_{max} - 1, {ch}^{*})

;

variables:

u (s, i) \in {0, 1}

,

s \in [1, s^{*}]

,

i \in [1, n_{tree}]

:

u (s, i)

represents vertex

u_{s, i}

(

u (s, i) = 1

⇔ vertex

u_{s, i}

is used in H and edge

e_{s, i}^{'}

(i \geq 2)

is used in H);

v (t, i) \in {0, 1}

,

t \in [1, t^{*}]

,

i \in [1, n_{tree}]

:

v (t, i)

represents vertex

v_{t, i}

(

v (t, i) = 1

⇔ vertex

v_{t, i}

is used in H and edge

e_{t, i}

(i \geq 2)

is used in H);

e (t) \in {0, 1}

,

t \in [1, t^{*} + 1]

:

e (t)

represents edge

e_{t} = v_{t - 1, 1} v_{t, i}

,

where

e_{1, 1}

and

e_{t^{*} + 1, 1}

are fictitious edges (

e (t) = 1

⇔ edge

e_{t}

is used in H);

constraints:

\begin{matrix} u (s, 1) = 1, & s \in [1, s^{*}], \end{matrix}

(A31)

\begin{matrix} d_{max} \cdot u (t, i) \geq \sum_{j \in Cld (i)} u (t, j), & t \in [1, {cs}^{*}], i \in [2, n_{in}], \end{matrix}

(A32)

\begin{matrix} v (t, 1) = 1, & t \in [1, t^{*}], \end{matrix}

(A33)

\begin{matrix} d_{max} \cdot v (t, i) \geq \sum_{j \in Cld (i)} v (t, j), & t \in [1, {cs}^{*}], i \in [2, n_{in}], \end{matrix}

(A34)

\begin{matrix} \sum_{s \in [1, s^{*}], i \in [1, n_{tree}]} u (s, i) + \sum_{t \in [1, t^{*}], i \in [1, n_{tree}]} v (t, i) = n^{*}, \end{matrix}

(A35)

\begin{matrix} \sum_{s \in [1, s^{*}], i \in Dst ({ch}^{*})} u (s, i) + \sum_{t \in [1, t^{*}], i \in Dst ({ch}^{*})} v (t, i) \geq 1, \end{matrix}

(A36)

\begin{matrix} e (1) = e (t^{*} + 1) = 0, \end{matrix}

(A37)

\begin{matrix} e (t + 1) + \sum_{s \in [1, s^{*}]} e (t, s) = 1, & t \in [1, t^{*}], \end{matrix}

(A38)

\begin{matrix} e (t) + \sum_{s \in [1, s^{*}]} e (s, t) = 1, & t \in [1, t^{*}], \end{matrix}

(A39)

\begin{matrix} c^{*} \geq χ (1) \geq χ (2) \geq \dots \geq χ (t^{*}) \geq 1, \end{matrix}

(A40)

\begin{matrix} e (t + 1) \geq 1 + χ (t + 1) - χ (t), & t \in [1, t^{*} - 1], \end{matrix}

(A41)

\begin{matrix} c^{*} \cdot (1 - e (t + 1)) \geq χ (t) - χ (t + 1), & t \in [1, t^{*} - 1] . \end{matrix}

(A42)

Appendix A.5. Assigning Multiplicity

We prepare an integer variable

\tilde{β} (e)

or

\hat{β} (e)

for each edge e in the

(t^{*}, {ch}^{*}, d_{max})

-tree-extension of the scheme graph to denote the multiplicity of e in a selected graph H and include necessary constraints for the variables to satisfy in H.

variables:

\tilde{β} (i) \in [0, 3]

,

i \in E_{1} \cup E_{3}

:

\tilde{β} (i)

represents the multiplicity of edge

a_{i}

,

where

\tilde{β} (i) = 0

if edge

a_{i}

is not in H;

\tilde{β} (p, i) \in [0, 3]

,

p \in [1, {cs}^{*}]

,

i \in [2, n_{tree}]

:

\tilde{β} (p, i)

with

p \leq s^{*}

(resp.,

p > s^{*}

) represents

the multiplicity of edge

e_{p, i}^{'}

(resp.,

e_{p - s^{*}, i}

);

\tilde{β} (t, 1) \in [0, 3]

,

t \in [1, t^{*} + 1]

:

\tilde{β} (t, 1)

represents the multiplicity of edge

e_{t}

;

\hat{β} (s, t) \in [0, 3]

,

s \in [1, s^{*}]

,

t \in [1, t^{*}]

:

\hat{β} (s, t)

represents the multiplicity of edge

u_{s, 1} v_{t, 1}

;

constraints:

\begin{matrix} a (i) = 1, & i \in E_{3}, \end{matrix}

(A43)

\begin{matrix} a (i) \leq \tilde{β} (i) \leq 3 a (i), & i \in E_{1} \cup E_{3}, \end{matrix}

(A44)

\begin{matrix} u (s, i) \leq \tilde{β} (s, i) \leq 3 u (s, i), & s \in [1, s^{*}], i \in [2, n_{tree}], \end{matrix}

(A45)

\begin{matrix} v (t, i) \leq \tilde{β} (s^{*} + t, i) \leq 3 v (t, i), & t \in [1, t^{*}], i \in [2, n_{tree}], \end{matrix}

(A46)

\begin{matrix} e (t) \leq \tilde{β} (t, 1) \leq 3 e (t), & t \in [1, t^{*} + 1], \end{matrix}

(A47)

\begin{matrix} e (s, t) + e (t, s) \leq \hat{β} (s, t) \leq 3 e (s, t) + 3 e (t, s), & s \in [1, s^{*}], t \in [1, t^{*}] . \end{matrix}

(A48)

Appendix A.6. Assigning Chemical Elements and Valence Condition

We include constraints so that each vertex v in a selected graph H satisfies the valence condition; i.e.,

\sum_{u v \in E (H)} β (u v) \leq val (α (u))

. With these constraints, a rank-2 chemical graph

G = (H, α, β)

on a selected subgraph H will be constructed.

constants:

A set

Λ \cup {ϵ}

of chemical elements, where

ϵ

denotes null;

A coding

[a]

,

a \in Λ \cup {ϵ}

such that

[ϵ] = 0

;

[a] \geq 1

,

a \in Λ

; and

[a] \neq [b]

if

a \neq b

;

Let

[Λ]

and

[Λ \cup {ϵ}]

denote

{[a] ∣ a \in Λ}

and

{[a] ∣ a \in Λ \cup {ϵ}}

, respectively;

A valence function:

val : Λ \to [1, 4]

;

variables:

\tilde{α} (p, i) \in [Λ \cup {ϵ}]

,

p \in [1, {cs}^{*}]

,

i \in [1, n_{tree}]

:

\tilde{α} (p, i)

with

p \leq s^{*}

(resp.,

p > s^{*}

) represents

α (u_{p, i})

(resp.,

α (v_{p - s^{*}, i})

);

δ_{α} (p, i, a) \in {0, 1}

,

p \in [1, {cs}^{*}]

,

i \in [1, n_{tree}]

,

a \in Λ \cup {ϵ}

:

δ_{α} (p, i, a) = 1

⇔

α (u_{p, i}) = a

for

p \leq s^{*}

and

α (v_{p - s^{*}, i}) = a

for

p > s^{*}

;

δ_{\tilde{β}} (i, k) \in {0, 1}

,

p \in [1, {cs}^{*}]

,

i \in E_{1} \cup E_{3}

,

k \in [0, 3]

:

δ_{\tilde{β}} (i, k) = 1

⇔ the multiplicity of edge

a_{i}

in H is k;

δ_{\tilde{β}} (p, i, k) \in {0, 1}

,

p \in [1, {cs}^{*}]

,

i \in [2, n_{tree}]

,

k \in [0, 3]

:

δ_{\tilde{β}} (p, i, k) = 1

⇔ the multiplicity of edge

e_{p, i}^{'}

,

p \leq s^{*}

(or

e_{p - s^{*}, i}

,

p > s^{*}

) in H is k;

δ_{\tilde{β}} (t, 1, k) \in {0, 1}

,

t \in [1, t^{*} + 1]

,

k \in [0, 3]

:

δ_{\tilde{β}} (t, 1, k) = 1

⇔ the multiplicity of edge

e_{t}

in H is k;

δ_{\hat{β}} (s, t, k) \in {0, 1}

,

s \in [1, s^{*}]

,

t \in [1, t^{*}]

,

k \in [0, 3]

:

δ_{\hat{β}} (s, t, k) = 1

⇔ the multiplicity of edge

u_{s, 1} v_{t, 1}

in H is k;

constraints:

\begin{matrix} \sum_{a \in Λ \cup {ϵ}} δ_{α} (p, i, a) = 1, & p \in [1, {cs}^{*}], i \in [1, n_{tree}], \end{matrix}

(A49)

\begin{matrix} \sum_{a \in Λ \cup {ϵ}} [a] \cdot δ_{α} (p, i, a) = \tilde{α} (p, i), & p \in [1, {cs}^{*}], i \in [1, n_{tree}], \end{matrix}

(A50)

\begin{matrix} \sum_{k \in [0, 3]} δ_{\tilde{β}} (i, k) = 1, & i \in E_{1} \cup E_{3}, \end{matrix}

(A51)

\begin{matrix} \sum_{k \in [1, 3]} k \cdot δ_{\tilde{β}} (i, k) = \tilde{β} (i), & i \in E_{1} \cup E_{3}, \end{matrix}

(A52)

\begin{matrix} \sum_{k \in [0, 3]} δ_{\tilde{β}} (p, i, k) = 1, & p \in [1, {cs}^{*}], i \in [2, n_{tree}], \end{matrix}

(A53)

\begin{matrix} \sum_{k \in [1, 3]} k \cdot δ_{\tilde{β}} (p, i, k) = \tilde{β} (p, i), & p \in [1, {cs}^{*}], i \in [2, n_{tree}], \end{matrix}

(A54)

\begin{matrix} \sum_{k \in [0, 3]} δ_{\tilde{β}} (t, 1, k) = 1, & t \in [1, t^{*} + 1], \end{matrix}

(A55)

\begin{matrix} \sum_{k \in [1, 3]} k \cdot δ_{\tilde{β}} (t, 1, k) = \tilde{β} (t, 1), & t \in [1, t^{*} + 1], \end{matrix}

(A56)

\begin{matrix} \sum_{k \in [0, 3]} δ_{\hat{β}} (s, t, k) = 1, & s \in [1, s^{*}], t \in [1, t^{*}], \end{matrix}

(A57)

\begin{matrix} \sum_{k \in [0, 3]} k δ_{\hat{β}} (s, t, k) = \hat{β} (s, t), & s \in [1, s^{*}], t \in [1, t^{*}], \end{matrix}

(A58)

\begin{matrix} \sum_{i \in E_{1, 3} (s)} \tilde{β} (i) + \sum_{t \in [1, t^{*}]} \hat{β} (s, t) \end{matrix}

\begin{matrix} + \sum_{j \in Cld (1)} \tilde{β} (s, j) \leq \sum_{a \in Λ} val (a) \cdot δ_{α} (s, 1, a), & s \in [1, s^{*}], \\ \sum_{s \in [1, s^{*}]} \hat{β} (s, t) + \tilde{β} (t, 1) + \tilde{β} (t + 1, 1) \end{matrix}

(A59)

\begin{matrix} + \sum_{j \in Cld (1)} \tilde{β} (s^{*} + t, j) \leq \sum_{a \in Λ} val (a) \cdot δ_{α} (s^{*} + t, 1, a), & t \in [1, t^{*}], \end{matrix}

(A60)

\begin{matrix} \tilde{β} (p, i) + \sum_{j \in Cld (i)} \tilde{β} (p, j) \leq \sum_{a \in Λ} val (a) \cdot δ_{α} (p, i, a), & p \in [1, {cs}^{*}], i \in [2, n_{tree}] . \end{matrix}

(A61)

Appendix A.7. Descriptors for Mass, the Numbers of Elements and Bonds

We include constraints to compute descriptors

\bar{ms} (G)

{ce}_{a}^{co} (G)

,

{ce}_{a}^{nc} (G)

(

a \in Λ)

,

b_{k} (G)

(

k \in [2, 3]

) and

n_{H} (G)

according to the definitions in Section 2.1.2.

constants:

A function

{mass}^{*} : Λ \to Z

; Let

mass (a)

denote the observed mass of a chemical element

a \in Λ

, and

define

{mass}^{*} (a) = ⌊ 10 \cdot mass (a) ⌋

;

variables:

{ce}^{co} (a) \in [0, n^{*}]

,

a \in Λ

;

{ce}^{nc} (a) \in [0, n^{*}]

,

a \in Λ

;

Mass \in Z

;

b^{co} (k) \in [0, 2 n^{*}]

,

k \in [1, 3]

;

b^{nc} (k) \in [0, 2 n^{*}]

,

k \in [1, 3]

;

n_{H} \in [0, 4 n^{*}]

: the number of hydrogen atoms to be included in G;

constraints:

\begin{matrix} \sum_{p \in [1, {cs}^{*}]} δ_{α} (p, 1, a) = {ce}^{co} (a), & a \in Λ, \end{matrix}

(A62)

\begin{matrix} \sum_{p \in [1, {cs}^{*}], i \in [2, n_{tree}]} δ_{α} (p, i, a) = {ce}^{nc} (a), & a \in Λ, \end{matrix}

(A63)

\begin{matrix} \sum_{a \in Λ} {mass}^{*} (a) ({ce}^{co} (a) + {ce}^{nc} (a)) = Mass, \\ \sum_{i \in E_{1} \cup E_{3}} δ_{\tilde{β}} (i, k) + \sum_{s \in [1, s^{*}], t \in [1, t^{*}]} δ_{\hat{β}} (s, t, k) \end{matrix}

(A64)

\begin{matrix} + \sum_{t \in [2, t^{*}]} δ_{\tilde{β}} (t, 1, k) = b^{co} (k), & k \in [1, 3], \end{matrix}

(A65)

\begin{matrix} \sum_{p \in [1, {cs}^{*}], i \in [2, n_{tree}]} δ_{\tilde{β}} (p, i, k) = b^{nc} (k), & k \in [1, 3], \\ \sum_{a \in Λ} val (a) ({ce}^{co} (a) + {ce}^{nc} (a)) \end{matrix}

(A66)

\begin{matrix} - 2 (n^{*} + 1 + b^{co} (2) + b^{nc} (2) + 2 b^{co} (3) + 2 b^{nc} (3)) = n_{H} . \end{matrix}

(A67)

Appendix A.8. Descriptor for the Number of Specified Degree

We include constraints to compute descriptors

{dg}_{i} (G)

(

i \in [1, 4]

) according to the definitions in Section 2.1.2. We also add constraints so that the maximum degree of a non-core vertex in H is at most 3 (resp., equal to 4) when

d_{max} = 3

(resp.,

d_{max} = 4)

.

variables:

\deg (p, i) \in [0, 4]

,

p \in [1, {cs}^{*}]

,

i \in [1, n_{tree}]

:

\deg (p, i)

represents

\deg_{H} (u_{p, i})

for

p \leq s^{*}

or

\deg_{H} (v_{p - s^{*}, i})

for

p > s^{*}

;

δ_{\deg} (p, i, d) \in {0, 1}

,

p \in [1, {cs}^{*}]

,

i \in [1, n_{tree}]

,

d \in [0, 4]

:

δ_{\deg} (p, i, d) = 1

⇔

\deg (p, i) = d

;

dg (d) \in [0, n^{*}]

,

d \in [1, 4]

;

constraints:

\begin{matrix} \sum_{i \in E_{1, 3} (s)} a (i) \end{matrix}

\begin{matrix} + \sum_{t \in [1, t^{*}]} (e (s, t) + e (t, s)) + \sum_{j \in Cld (1)} u (s, j) = \deg (s, 1), & s \in [1, s^{*}], \end{matrix}

(A68)

\begin{matrix} u (s, i) + \sum_{j \in Cld (i)} u (s, j) = \deg (s, i), & s \in [1, s^{*}], i \in [2, n_{tree}], \end{matrix}

(A69)

\begin{matrix} 2 + \sum_{j \in Cld (1)} v (t, j) = \deg (s^{*} + t, 1), & t \in [1, t^{*}], \end{matrix}

(A70)

\begin{matrix} v (t, i) + \sum_{j \in Cld (i)} v (t, j) = \deg (s^{*} + t, i), & t \in [1, t^{*}], i \in [2, n_{tree}], \end{matrix}

(A71)

\begin{matrix} \sum_{d \in [0, 4]} δ_{\deg} (p, i, d) = 1, & p \in [1, {cs}^{*}], i \in [1, n_{tree}], \end{matrix}

(A72)

\begin{matrix} \sum_{d \in [1, 4]} d \cdot δ_{\deg} (p, i, d) = \deg (p, i), & p \in [1, {cs}^{*}], i \in [1, n_{tree}], \end{matrix}

(A73)

\begin{matrix} \sum_{p \in [1, {cs}^{*}], i \in [1, n_{tree}]} δ_{\deg} (p, i, d) = dg (d), & d \in [1, 4], \end{matrix}

(A74)

\begin{matrix} \sum_{p \in [1, {cs}^{*}], i \in [2, n_{tree}]} δ_{\deg} (p, i, 4) \geq 1 (r e s p ., = 0) & w h e n d_{max} = 4 (r e s p ., = 3) . \end{matrix}

(A75)

Appendix A.9. Descriptor for the Number of Adjacency-Configurations

We include constraints to compute descriptors

{ac}_{γ}^{co} (G)

and

{ac}_{γ}^{nc} (G)

(

γ = (a, b, k) \in Γ

) according to the definitions in Section 2.1.2.

constants:

A set

Γ = Γ_{<} \cup Γ_{=} \cup Γ_{>}

of proper tuples

(a, b, k) \in Λ \times Λ \times [1, 3]

;

The set

Γ_{0} = {(a, b, 0) ∣ a, b \in Λ \cup {ϵ}}

;

variables:

δ_{τ} (i, γ) \in {0, 1}

,

i \in E_{1} \cup E_{3}

,

γ \in Γ \cup Γ_{0}

:

δ_{τ} (i, γ) = 1

⇔ edge

a_{i}

is assigned tuple

γ

; i.e.,

γ = (\tilde{α} (tail (i), 1), \tilde{α} (head (i), 1), \tilde{β} (i))

;

δ_{τ} (t, 1, γ) \in {0, 1}

,

t \in [2, t^{*}]

,

γ \in Γ \cup Γ_{0}

:

δ_{τ} (t, 1, γ) = 1

⇔ edge

e_{t}

is assigned tuple

γ

; i.e.,

γ = (\tilde{α} (s^{*} + t - 1, 1), \tilde{α} (s^{*} + t, 1), \tilde{β} (t, 1))

;

δ_{τ} (t, i, γ) \in {0, 1}

,

p \in [1, {cs}^{*}]

,

i \in [2, n_{tree}]

,

γ \in Γ \cup Γ_{0}

:

δ_{τ} (t, i, γ) = 1

⇔ edge

e_{p, i}^{'}

,

p \leq s^{*}

(or

e_{p - s^{*}, i}

,

p > s^{*}

) is assigned tuple

γ

; i.e.,

γ = (\tilde{α} (p, prt (i)), \tilde{α} (p, i), \tilde{β} (p, i))

;

δ_{\hat{τ}} (s, t, γ) \in {0, 1}

,

s \in [1, s^{*}]

,

t \in [1, t^{*}]

,

γ \in Γ \cup Γ_{0}

:

δ_{\hat{τ}} (s, t, γ) = 1

⇔ edge

u_{s, 1} v_{t, 1}

is assigned tuple

γ

; i.e.,

γ = (\tilde{α} (s, 1), \tilde{α} (s^{*} + t, 1), \hat{β} (s, t))

;

{ac}^{co} (γ) \in [0, n^{*}]

,

γ \in Γ_{<} \cup Γ_{=}

;

{ac}^{nc} (γ) \in [0, n^{*}]

,

γ \in Γ_{<} \cup Γ_{=}

;

constraints:

\begin{matrix} \sum_{γ \in Γ \cup Γ_{0}} δ_{τ} (i, γ) = 1, & i \in E_{1} \cup E_{3}, \end{matrix}

(A76)

\begin{matrix} \sum_{(a, b, k) \in Γ \cup Γ_{0}} [a] δ_{τ} (i, (a, b, k)) = \tilde{α} (tail (i), 1), & i \in E_{1} \cup E_{3}, \end{matrix}

(A77)

\begin{matrix} \sum_{(a, b, k) \in Γ \cup Γ_{0}} [b] δ_{τ} (i, (a, b, k)) = \tilde{α} (head (i), 1), & i \in E_{1} \cup E_{3}, \end{matrix}

(A78)

\begin{matrix} \sum_{(a, b, k) \in Γ \cup Γ_{0}} k \cdot δ_{τ} (i, (a, b, k)) = \tilde{β} (i), & i \in E_{1} \cup E_{3}, \end{matrix}

(A79)

\begin{matrix} \sum_{γ \in Γ \cup Γ_{0}} δ_{τ} (t, 1, γ) = 1, & t \in [2, t^{*}], \end{matrix}

(A80)

\begin{matrix} \sum_{(a, b, k) \in Γ \cup Γ_{0}} [a] δ_{τ} (t, 1, (a, b, k)) = \tilde{α} (s^{*} + t - 1, 1), & t \in [2, t^{*}], \end{matrix}

(A81)

\begin{matrix} \sum_{(a, b, k) \in Γ \cup Γ_{0}} [b] δ_{τ} (t, 1, (a, b, k)) = \tilde{α} (s^{*} + t, 1), & t \in [2, t^{*}], \end{matrix}

(A82)

\begin{matrix} \sum_{(a, b, k) \in Γ \cup Γ_{0}} k \cdot δ_{τ} (t, 1, (a, b, k)) = \tilde{β} (t, 1), & t \in [2, t^{*}], \end{matrix}

(A83)

\begin{matrix} \sum_{γ \in Γ \cup Γ_{0}} δ_{τ} (p, i, γ) = 1, & p \in [1, {cs}^{*}], i \in [2, n_{tree}], \end{matrix}

(A84)

\begin{matrix} \sum_{(a, b, k) \in Γ \cup Γ_{0}} [a] δ_{τ} (p, i, (a, b, k)) = \tilde{α} (p, prt (i)), & p \in [1, {cs}^{*}], i \in [2, n_{tree}], \end{matrix}

(A85)

\begin{matrix} \sum_{(a, b, k) \in Γ \cup Γ_{0}} [b] δ_{τ} (p, i, (a, b, k)) = \tilde{α} (p, i), & p \in [1, {cs}^{*}], i \in [2, n_{tree}], \end{matrix}

(A86)

\begin{matrix} \sum_{(a, b, k) \in Γ \cup Γ_{0}} k \cdot δ_{τ} (p, i, (a, b, k)) = \tilde{β} (p, i), & p \in [1, {cs}^{*}], i \in [2, n_{tree}], \end{matrix}

(A87)

\begin{matrix} \sum_{γ \in Γ \cup Γ_{0}} δ_{\hat{τ}} (s, t, γ) = 1, & s \in [1, s^{*}], t \in [1, t^{*}], \end{matrix}

(A88)

\begin{matrix} \sum_{(a, b, k) \in Γ \cup Γ_{0}} [a] δ_{\hat{τ}} (s, t, (a, b, k)) = \tilde{α} (s, 1), & s \in [1, s^{*}], t \in [1, t^{*}], \end{matrix}

(A89)

\begin{matrix} \sum_{(a, b, k) \in Γ \cup Γ_{0}} [b] δ_{\hat{τ}} (s, t, (a, b, k)) = \tilde{α} (s^{*} + t, 1), & s \in [1, s^{*}], t \in [1, t^{*}], \end{matrix}

(A90)

\begin{matrix} \sum_{(a, b, k) \in Γ \cup Γ_{0}} k \cdot δ_{\hat{τ}} (s, t, (a, b, k)) = \hat{β} (s, t), & s \in [1, s^{*}], t \in [1, t^{*}], \end{matrix}

(A91)

\begin{matrix} \sum_{i \in E_{1} \cup E_{3}} (δ_{τ} (i, γ) + δ_{τ} (i, \bar{γ})) \\ + \sum_{s \in [1, s^{*}], t \in [1, t^{*}]} (δ_{\hat{τ}} (s, t, γ) + δ_{\hat{τ}} (s, t, \bar{γ})) \end{matrix}

(A92)

\begin{matrix} + \sum_{t \in [2, t^{*}]} (δ_{τ} (t, 1, γ) + δ_{τ} (t, 1, \bar{γ})) = {ac}^{co} (γ), & γ \in Γ_{<}, \\ \sum_{i \in E_{1} \cup E_{3}} δ_{τ} (i, γ) + \sum_{s \in [1, s^{*}], t \in [1, t^{*}]} δ_{\hat{τ}} (s, t, γ) \end{matrix}

(A93)

\begin{matrix} + \sum_{t \in [2, t^{*}]} δ_{τ} (t, 1, γ) = {ac}^{co} (γ), & γ \in Γ_{=}, \end{matrix}

(A94)

\begin{matrix} \sum_{p \in [1, {cs}^{*}], i \in [2, n_{tree}]} (δ_{τ} (p, i, γ) + δ_{τ} (p, i, \bar{γ})) = {ac}^{nc} (γ), & γ \in Γ_{<}, \end{matrix}

(A95)

Appendix A.10. Descriptor for 1-Path Connectivity

We include constraints to compute descriptor

κ_{1} (G)

according to the definition.

variables:

A real variable

κ_{1} \geq 0

;

δ_{dd} (i, d, d^{'}, μ) \in {0, 1}

,

i \in E_{1} \cup E_{3}

,

d, d^{'} \in [0, 4]

,

μ \in {0, 1}

:

δ_{dd} (i, d, d^{'}, μ) = 1

⇔

\deg_{H} (u_{tail} (i)) = d

and

\deg_{H} (u_{head} (i)) = d^{'}

,

where

a_{i}

is in H if and only if

μ = 1

;

δ_{dd} (t, 1, d, d^{'}, μ) \in {0, 1}

,

t \in [2, t^{*}]

,

d, d^{'} \in [0, 4]

:

δ_{dd} (t, 1, d, d^{'}, μ) = 1

⇔

\deg_{H} (v_{t - 1, 1}) = d

and

\deg_{H} (v_{t, 1}) = d^{'}

where

e_{t}

is in H if and only if

μ = 1

;

δ_{dd} (p, i, d, d^{'}, μ) \in {0, 1}

,

p \in [1, {cs}^{*}]

,

i \in [2, n_{tree}]

,

d, d^{'} \in [0, 4]

:

δ_{dd} (p, i, d, d^{'}, μ) = 1

⇔

\deg_{H} (u_{p, prt (i)}) = d

and

\deg_{H} (u_{p, i}) = d^{'}

for

p \leq s^{*}

(or

\deg_{H} (v_{p - s^{*}, prt (i)}) = d

and

\deg_{H} (v_{p - s^{*}, i}) = d^{'}

for

p > s^{*}

),

where edge

e_{p, i}^{'}

or

e_{p - s^{*}, i}

is in H if and only if

μ = 1

;

δ_{\hat{dd}} (s, t, d, d^{'}, μ) \in {0, 1}

,

s \in [1, s^{*}]

,

t \in [1, t^{*}]

,

d, d^{'} \in [0, 4]

,

μ \in {0, 1}

:

δ_{\hat{dd}} (s, t, d, d^{'}, 1) = 1

⇔

\deg_{H} (u_{s, 1}) = d

and

\deg_{H} (v_{t, 1}) = d^{'}

,

where

u_{s, 1} v_{t, 1}

is in H if and only if

μ = 1

;

constraints:

\begin{matrix} \sum_{d, d^{'} \in [0, 4], μ \in {0, 1}} δ_{dd} (i, d, d^{'}, μ) = 1, & i \in E_{1} \cup E_{3}, \end{matrix}

(A96)

\begin{matrix} \sum_{d, d^{'} \in [0, 4], μ \in {0, 1}} μ \cdot δ_{dd} (i, d, d^{'}, μ) = a (i), & i \in E_{1} \cup E_{3}, \end{matrix}

(A97)

\begin{matrix} \sum_{d \in [1, 4], d^{'} \in [0, 4], μ \in {0, 1}} d \cdot δ_{dd} (i, d, d^{'}, μ) = \deg (tail (i), 1), & i \in E_{1} \cup E_{3}, \end{matrix}

(A98)

\begin{matrix} \sum_{d \in [0, 4], d^{'} \in [1, 4], μ \in {0, 1}} d^{'} \cdot δ_{dd} (i, d, d^{'}, μ) = \deg (head (i), 1), & i \in E_{1} \cup E_{3}, \end{matrix}

(A99)

\begin{matrix} \sum_{d, d^{'} \in [0, 4], μ \in {0, 1}} δ_{dd} (t, 1, d, d^{'}, μ) = 1, & t \in [2, t^{*}], \end{matrix}

(A100)

\begin{matrix} \sum_{d, d^{'} \in [0, 4], μ \in {0, 1}} μ \cdot δ_{dd} (t, 1, d, d^{'}, μ) = e (t), & t \in [2, t^{*}], \end{matrix}

(A101)

\begin{matrix} \sum_{d \in [1, 4], d^{'} \in [0, 4], μ \in {0, 1}} d \cdot δ_{dd} (t, 1, d, d^{'}, μ) = \deg (s^{*} + t - 1, 1), & t \in [2, t^{*}], \end{matrix}

(A102)

\begin{matrix} \sum_{d \in [0, 4], d^{'} \in [1, 4], μ \in {0, 1}} d^{'} \cdot δ_{dd} (t, 1, d, d^{'}, μ) = \deg (s^{*} + t, 1), & t \in [2, t^{*}], \end{matrix}

(A103)

\begin{matrix} \sum_{d, d^{'} \in [0, 4], μ \in {0, 1}} δ_{dd} (p, i, d, d^{'}, μ) = 1, & p \in [1, {cs}^{*}], i \in [2, n_{tree}], \end{matrix}

(A104)

\begin{matrix} \sum_{d, d^{'} \in [0, 4], μ \in {0, 1}} μ \cdot δ_{dd} (s, i, d, d^{'}, μ) = u (s, i), & s \in [1, s^{*}], i \in [2, n_{tree}], \end{matrix}

(A105)

\begin{matrix} \sum_{d, d^{'} \in [0, 4], μ \in {0, 1}} μ \cdot δ_{dd} (s^{*} + t, i, d, d^{'}, μ) = v (t, i), & t \in [1, t^{*}], i \in [2, n_{tree}], \end{matrix}

(A106)

\begin{matrix} \sum_{d \in [1, 4], d^{'} \in [0, 4], μ \in {0, 1}} d \cdot δ_{dd} (p, i, d, d^{'}, μ) = \deg (p, prt (i)), & p \in [1, {cs}^{*}], i \in [2, n_{tree}], \end{matrix}

(A107)

\begin{matrix} \sum_{d \in [0, 4], d^{'} \in [1, 4], μ \in {0, 1}} d^{'} \cdot δ_{dd} (t, i, d, d^{'}, μ) = \deg (p, i), & p \in [1, {cs}^{*}], i \in [2, n_{tree}], \end{matrix}

(A108)

\begin{matrix} \sum_{d, d^{'} \in [1, 4], μ \in {0, 1}} δ_{\hat{dd}} (s, t, d, d^{'}, μ) = 1, & s \in [1, s^{*}], t \in [1, t^{*}], \end{matrix}

(A109)

\begin{matrix} \sum_{d, d^{'} \in [1, 4], μ \in {0, 1}} μ \cdot δ_{\hat{dd}} (s, t, d, d^{'}, μ) = e (s, t) + e (t, s), & s \in [1, s^{*}], t \in [1, t^{*}], \end{matrix}

(A110)

\begin{matrix} \sum_{d \in [1, 4], d^{'} \in [0, 4], μ \in {0, 1}} d \cdot δ_{\hat{dd}} (s, t, d, d^{'}, μ) = \deg (s, 1), & s \in [1, s^{*}], t \in [1, t^{*}], \end{matrix}

(A111)

\begin{matrix} \sum_{d \in [0, 4], d^{'} \in [1, 4], μ \in {0, 1}} d^{'} \cdot δ_{\hat{dd}} (s, t, d, d^{'}, μ) = \deg (s^{*} + t, 1), & s \in [1, s^{*}], t \in [1, t^{*}], \\ (1 - ξ) κ_{1} \leq \sum_{i \in E_{1} \cup E_{3}, d, d^{'} \in [1, 4]} δ_{dd} (i, d, d^{'}, 1) / \sqrt{d d^{'}} \\ + \sum_{t \in [2, t^{*}], d, d^{'} \in [1, 4]} δ_{dd} (t, 1, d, d^{'}, 1) / \sqrt{d d^{'}} \\ + \sum_{\begin{matrix} p \in [1, {cs}^{*}], i \in [2, n_{tree}], \\ d, d^{'} \in [1, 4] \end{matrix}} δ_{dd} (p, i, d, d^{'}, 1) / \sqrt{d d^{'}} \end{matrix}

(A112)

\begin{matrix} + \sum_{\begin{matrix} s \in [1, s^{*}], t \in [1, t^{*}], \\ d, d^{'} \in [1, 4] \end{matrix}} δ_{\hat{dd}} (s, t, d, d^{'}, 1) / \sqrt{d d^{'}} & \leq (1 + ξ) κ_{1}, \end{matrix}

(A113)

where a tolerance

ξ

is set to be

0.001

.

Appendix A.11. Constraints for Left-Heavy Trees

To reduce the number of rank-2 chemical graphs G that are isomorphic to each other, we include in

C_{2}

some additional constraints so that each subtree

T^{'}

selected from tree

S_{p}

or

T_{t}

satisfies the following property:

for any two siblings $u (p, j_{1})$ and $u (p, j_{2})$ , $j_{1} < j_{2}$ in $T^{'}$ , the number of descendants of $u (p, j_{1})$ is not smaller than that of $u (p, j_{2})$ .

For this, we define

dsn (p, i)

to be the number of descendants of a vertex

u_{p, i}

(or

v_{p - s^{*}, i}

) in a selected graph H and

η (p, i) ≜ 21 | Λ | dsn (p, i) + 20 \tilde{α} (p, i) + 4 \deg (p, i) + \tilde{β} (p, i)

,

p \in [1, {cs}^{*}]

,

i \in [2, n_{tree}]

. We include constraints that compute the values of

dsn

recursively.

variables:

dsn (p, i) \in [1, n_{tree}]

,

p \in [1, {cs}^{*}]

,

i \in [1, n_{tree}]

: the number of descendants of vertex

u_{p, i}

in tree

S_{p}

for

p \leq s^{*}

and vertex

v_{p - s^{*}, i}

in tree

T_{p - s^{*}}

for

p > s^{*}

;

constraints:

\begin{matrix} dsn (s, i) \geq \sum_{j \in Cld (i)} dsn (s, j) + u (s, i), & s \in [1, s^{*}], i \in [1, n_{tree}], \end{matrix}

(A114)

\begin{matrix} dsn (s^{*} + t, i) \geq \sum_{j \in Cld (i)} dsn (s^{*} + t, j) + v (t, i), & t \in [s^{*} + 1, {cs}^{*}], i \in [1, n_{tree}], \end{matrix}

(A115)

\begin{matrix} \sum_{p \in [1, {cs}^{*}]} dsn (p, 1) \leq n^{*}, \end{matrix}

(A116)

\begin{matrix} η (p, j_{1}) \geq η (p, j_{2}), & p \in [1, {cs}^{*}], j_{1}, j_{2} \in Cld (1), j_{1} < j_{2}, \\ η (p, j_{1}) \geq η (p, j_{2}), & p \in [1, {cs}^{*}], i \in [2, n_{in}], j_{1}, j_{2} \in Cld (i), \end{matrix}

(A117)

\begin{matrix} j_{1} < j_{2}, f o r d_{max} = 3, \\ η (p, j_{1}) \geq η (p, j_{2}) \geq η (p, j_{3}), & p \in [1, {cs}^{*}], i \in [2, n_{in}], j_{1}, j_{2}, j_{3} \in Cld (i), \end{matrix}

(A118)

\begin{matrix} j_{1} < j_{2} < j_{3}, f o r d_{max} = 4 . \end{matrix}

(A119)

References

Miyao, T.; Kaneko, H.; Funatsu, K. Inverse QSPR/QSAR analysis for chemical structure generation (from y to x). J. Chem. Inf. Model. 2016, 56, 286–299. [Google Scholar] [CrossRef] [PubMed]
Skvortsova, M.I.; Baskin, I.I.; Slovokhotova, O.L.; Palyulin, V.A.; Zefirov, N.S. Inverse problem in QSAR/QSPR studies for the case of topological indices characterizing molecular shape (Kier indices). J. Chem. Inf. Comput. Sci. 1993, 33, 630–634. [Google Scholar] [CrossRef]
Ikebata, H.; Hongo, K.; Isomura, T.; Maezono, R.; Yoshida, R. Bayesian molecular design with a chemical language model. J. Comput. Aided Mol. Des. 2017, 31, 379–391. [Google Scholar] [CrossRef] [PubMed]
Rupakheti, C.; Virshup, A.; Yang, W.; Beratan, D.N. Strategy to discover diverse optimal molecules in the small molecule universe. J. Chem. Inf. Model. 2015, 55, 529–537. [Google Scholar] [CrossRef] [PubMed]
Fujiwara, H.; Wang, J.; Zhao, L.; Nagamochi, H.; Akutsu, T. Enumerating treelike chemical graphs with given path frequency. J. Chem. Inf. Model. 2008, 48, 1345–1357. [Google Scholar] [CrossRef] [PubMed]
Kerber, A.; Laue, R.; Grüner, T.; Meringer, M. MOLGEN 4.0. Match Commun. Math. Comput. Chem. 1998, 37, 205–208. [Google Scholar]
Li, J.; Nagamochi, H.; Akutsu, T. Enumerating substituted benzene isomers of tree-like chemical graphs. IEEE/ACM Trans. Comput. Biol. Bioinform. 2016, 15, 633–646. [Google Scholar] [CrossRef] [PubMed]
Reymond, J.L. The chemical space project. Accounts Chem. Res. 2015, 48, 722–730. [Google Scholar] [CrossRef] [PubMed]
Akutsu, T.; Fukagawa, D.; Jansson, J.; Sadakane, K. Inferring a Graph From Path Frequency. Discret. Appl. Math. 2012, 160, 1416–1428. [Google Scholar] [CrossRef][Green Version]
Nagamochi, H. A detachment algorithm for inferring a graph from path frequency. Algorithmica 2009, 53, 207–224. [Google Scholar] [CrossRef]
Fazekas, S.Z.; Ito, H.; Okuno, Y.; Seki, S.; Taneishi, K. On computational complexity of graph inference from counting. Nat. Comput. 2013, 12, 589–603. [Google Scholar] [CrossRef]
Bohacek, R.S.; McMartin, C.; Guida, W.C. The art and practice of structure-based drug design: A molecular modeling perspective. Med. Res. Rev. 1996, 16, 3–50. [Google Scholar] [CrossRef]
Gómez-Bombarelli, R.; Wei, J.N.; Duvenaud, D.; Hernández-Lobato, J.M.; Sánchez-Lengeling, B.; Sheberla, D.; Aguilera-Iparraguirre, J.; Hirzel, T.D.; Adams, R.P.; Aspuru-Guzik, A. Automatic chemical design using a data-driven continuous representation of molecules. ACS Cent. Sci. 2018, 4, 268–276. [Google Scholar] [CrossRef] [PubMed]
Segler, M.H.S.; Kogej, T.; Tyrchan, C.; Waller, M.P. Generating focused molecule libraries for drug discovery with recurrent neural networks. ACS Cent. Sci. 2017, 4, 120–131. [Google Scholar] [CrossRef] [PubMed]
Yang, X.; Zhang, J.; Yoshizoe, K.; Terayama, K.; Tsuda, K. ChemTS: An efficient python library for de novo molecular generation. Sci. Technol. Adv. Mater. 2017, 18, 972–976. [Google Scholar] [CrossRef] [PubMed]
Kusner, M.J.; Paige, B.; Hernández-Lobato, J.M. Grammar variational autoencoder. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; Volume 70, pp. 1945–1954. [Google Scholar]
Akutsu, T.; Nagamochi, H. A Mixed Integer Linear Programming Formulation to Artificial Neural Networks. In Proceedings of the 2nd International Conference on Information Science and Systems, Tokyo, Japan, 16–19 March 2019; pp. 215–220. [Google Scholar]
Azam, N.A.; Chiewvanichakorn, R.; Zhang, F.; Shurbevski, A.; Nagamochi, H.; Akutsu, T. A method for the inverse QSAR/QSPR based on artificial neural networks and mixed integer linear programming. In Proceedings of the 13th International Joint Conference on Biomedical Engineering Systems and Technologies, Valletta, Malta, 24–26 February 2020; Volume 3, pp. 101–108. [Google Scholar]
Chiewvanichakorn, R.; Wang, C.; Zhang, Z.; Shurbevski, A.; Nagamochi, H.; Akutsu, T. A method for the inverse QSAR/QSPR based on artificial neural networks and mixed integer linear programming. In Proceedings of the ICBBB2020, Kyoto, Japan, 19–22 January 2020. [Google Scholar]
Zhang, F.; Zhu, J.; Chiewvanichakorn, R.; Shurbevski, A.; Nagamochi, H.; Akutsu, T. A new integer linear programming formulation to the inverse QSAR/QSPR for acyclic chemical compounds using skeleton trees. In Proceedings of the 33rd International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems, Kitakyushu, Japan, 22–25 September 2020. [Google Scholar]
Ito, R.; Azam, N.A.; Wang, C.; Shurbevski, A.; Nagamochi, H.; Akutsu, T. A novel method for the inverse QSAR/QSPR to monocyclic chemical compounds based on artificial neural networks and integer programming, 2020. In Proceedings of the BIOCOMP 2020, Las Vegas, NV, USA, 27–30 July 2020. [Google Scholar]
Suzuki, M.; Nagamochi, H.; Akutsu, T. Efficient enumeration of monocyclic chemical graphs with given path frequencies. J. Cheminform. 2014, 6, 31. [Google Scholar] [CrossRef] [PubMed]
Tezuka, Y.; Oike, H. Topological polymer chemistry. Prog. Polym. Sci. 2002, 27, 1069–1122. [Google Scholar] [CrossRef]
Netzeva, T.I.; Worth, A.P.; Aldenberg, T.; Benigni, R.; Cronin, M.T.; Gramatica, P.; Jaworska, J.S.; Kahn, S.; Klopman, G.; Marchant, C.A.; et al. Current status of methods for defining the applicability domain of (quantitative) structure-activity relationships: The report and recommendations of ECVAM workshop 52. Altern. Lab. Anim. 2005, 33, 155–173. [Google Scholar] [CrossRef] [PubMed]
Tamura, Y.; Nishiyama, Y.; Wang, C.; Sun, Y.; Shurbevski, A.; Nagamochi, H.; Akutsu, T. Enumerating chemical graphs with mono-block 2-augmented tree structure from given upper and lower bounds on path frequencies. arXiv 2020, arXiv:2004.06367. [Google Scholar]
Yamashita, K.; Masui, R.; Zhou, X.; Wang, C.; Shurbevski, A.; Nagamochi, H.; Akutsu, T. Enumerating chemical graphs with two disjoint cycles satisfying given path frequency specifications. arXiv 2020, arXiv:2004.08381. [Google Scholar]

Figure 1. An illustration of the three rank-2 polymer topologies

M_{1}, M_{2}, M_{3} \in PT (2, 4)

.

Figure 1. An illustration of the three rank-2 polymer topologies

M_{1}, M_{2}, M_{3} \in PT (2, 4)

.

Figure 2. An illustration of the least simple graphs of the rank-2 polymer topologies

M_{1}, M_{2}, M_{3} \in PT (2, 4)

in Figure 1 and a scheme graph

(K, E)

: (a)

S (M_{1})

; (b)

S (M_{2})

; (c)

S (M_{3})

; (d) a scheme graph

(K = ({u_{1}, u_{2}, u_{3}, u_{4}}, E), E = (E_{1}, E_{2}, E_{3}))

where each edge

u_{i} u_{j}

is directed from one end-vertex

u_{i}

to the other end-vertex

u_{j}

with

i < j

, and

E_{1} = {a_{1} = (u_{1}, u_{4}), a_{2} = (u_{2}, u_{3}), a_{3} = (u_{2}, u_{4})}

,

E_{2} = {a_{4} = (u_{1}, u_{2}), a_{5} = (u_{3}, u_{4})}

and

E_{3} = {a_{6} = (u_{1}, u_{2}), a_{7} = (u_{3}, u_{4})}

, and the edges in

E_{1}

(resp.,

E_{2}

and

E_{3}

) are depicted with dashed (resp., dotted and solid) lines.

Figure 2. An illustration of the least simple graphs of the rank-2 polymer topologies

M_{1}, M_{2}, M_{3} \in PT (2, 4)

in Figure 1 and a scheme graph

(K, E)

: (a)

S (M_{1})

; (b)

S (M_{2})

; (c)

S (M_{3})

; (d) a scheme graph

(K = ({u_{1}, u_{2}, u_{3}, u_{4}}, E), E = (E_{1}, E_{2}, E_{3}))

where each edge

u_{i} u_{j}

is directed from one end-vertex

u_{i}

to the other end-vertex

u_{j}

with

i < j

, and

E_{1} = {a_{1} = (u_{1}, u_{4}), a_{2} = (u_{2}, u_{3}), a_{3} = (u_{2}, u_{4})}

,

E_{2} = {a_{4} = (u_{1}, u_{2}), a_{5} = (u_{3}, u_{4})}

and

E_{3} = {a_{6} = (u_{1}, u_{2}), a_{7} = (u_{3}, u_{4})}

, and the edges in

E_{1}

(resp.,

E_{2}

and

E_{3}

) are depicted with dashed (resp., dotted and solid) lines.

Figure 3. An illustration of Step 1: A data set

D_{π}

of chemical graphs

G_{i}

,

i = 1, 2, \dots, m

in a class

G

of graphs whose values

a (G_{i}) \in [\underset{̲}{a}, \bar{a}]

of a chemical property

π

are available.

Figure 3. An illustration of Step 1: A data set

D_{π}

of chemical graphs

G_{i}

,

i = 1, 2, \dots, m

in a class

G

of graphs whose values

a (G_{i}) \in [\underset{̲}{a}, \bar{a}]

of a chemical property

π

are available.

Figure 4. An illustration of Step 2: Each chemical graph

G \in G

is mapped to a vector

f (G)

in a feature vector space

R^{k}

for some positive integer k.

Figure 4. An illustration of Step 2: Each chemical graph

G \in G

is mapped to a vector

f (G)

in a feature vector space

R^{k}

for some positive integer k.

Figure 5. An illustration of Step 3: A prediction function

ψ_{N}

from the feature vector space

R^{k}

to the range

[\underset{̲}{a}, \bar{a}]

is constructed based on an ANN

N

.

Figure 5. An illustration of Step 3: A prediction function

ψ_{N}

from the feature vector space

R^{k}

to the range

[\underset{̲}{a}, \bar{a}]

is constructed based on an ANN

N

.

Figure 6. An illustration of Step 4: Given a target value

y^{*} \in [\underset{̲}{a}, \bar{a}]

, solving MILP

M (x, y, g; C_{1}, C_{2})

either delivers a set

F^{*}

of vectors

x^{*} \in A \cap D

such that

(1 - ε) y^{*} \leq ψ_{N} (x^{*}) \leq (1 + ε) y^{*}

or detects that no such vector

x

exists.

Figure 6. An illustration of Step 4: Given a target value

y^{*} \in [\underset{̲}{a}, \bar{a}]

, solving MILP

M (x, y, g; C_{1}, C_{2})

either delivers a set

F^{*}

of vectors

x^{*} \in A \cap D

such that

(1 - ε) y^{*} \leq ψ_{N} (x^{*}) \leq (1 + ε) y^{*}

or detects that no such vector

x

exists.

Figure 7. An illustration of Step 5: For each vector

x^{*} \in F^{*}

, all chemical graphs

G^{*} \in G

such that

f (G^{*}) = x^{*}

are generated.

Figure 7. An illustration of Step 5: For each vector

x^{*} \in F^{*}

, all chemical graphs

G^{*} \in G

such that

f (G^{*}) = x^{*}

are generated.

Figure 8. An illustration of a tree-extension, where the vertices in

V (K)

are depicted with gray circles: (a) The structure of the rooted tree

S_{s}

rooted at a vertex

u_{s, 1}

; (b) the structure of the rooted tree

T_{t}

rooted at a vertex

v_{t, 1}

; (c) the

(a, b, c)

-tree-extension of the scheme graph in Figure 2d for

a = t^{*} = 3

,

b = {ch}^{*} = 2

and

c = d_{max} = 4

.

Figure 8. An illustration of a tree-extension, where the vertices in

V (K)

are depicted with gray circles: (a) The structure of the rooted tree

S_{s}

rooted at a vertex

u_{s, 1}

; (b) the structure of the rooted tree

T_{t}

rooted at a vertex

v_{t, 1}

; (c) the

(a, b, c)

-tree-extension of the scheme graph in Figure 2d for

a = t^{*} = 3

,

b = {ch}^{*} = 2

and

c = d_{max} = 4

.

Figure 9. (a) An example of an extension of the scheme graph; (b) an example of a rank-2 graph H with

n (H) = 21

,

cs (H) = 9

,

ch (H) = 2

and

θ (H) = 1

, where the labels of some vertices and edges indicate the corresponding vertices and edges in the

(t^{*}, {ch}^{*}, d_{max})

-tree-extension for

{cs}^{*} = cs (H)

,

{ch}^{*} = ch (H)

,

s^{*} = 4

,

t^{*} = {cs}^{*} - s^{*}

and

d_{max} = 3

; (c) a subgraph

H^{'}

of

(t^{*} = 5, {ch}^{*} = 2, d_{max} = 3)

-tree-extension isomorphic to the rank-2 graph H in (b).

Figure 9. (a) An example of an extension of the scheme graph; (b) an example of a rank-2 graph H with

n (H) = 21

,

cs (H) = 9

,

ch (H) = 2

and

θ (H) = 1

, where the labels of some vertices and edges indicate the corresponding vertices and edges in the

(t^{*}, {ch}^{*}, d_{max})

-tree-extension for

{cs}^{*} = cs (H)

,

{ch}^{*} = ch (H)

,

s^{*} = 4

,

t^{*} = {cs}^{*} - s^{*}

and

d_{max} = 3

; (c) a subgraph

H^{'}

of

(t^{*} = 5, {ch}^{*} = 2, d_{max} = 3)

-tree-extension isomorphic to the rank-2 graph H in (b).

Figure 10. An illustration of inferred rank-2 chemical graphs

G^{*}

with

θ = - 2

: (a)

y_{K ow}^{*} = 5

,

θ = - 2

,

n = 30

, core size = 16, core height = 3,

d_{max} = 4

; (b)

y_{Mp}^{*} = 250

,

θ = - 2

,

n = 30

, core size = 16, core height= 2,

d_{max} = 3

; (c)

y_{Bp}^{*} = 150

,

θ = - 2

,

n = 25

, core size = 17, core height = 4,

d_{max} = 3

; (d)

y_{K ow}^{*} = 5

,

y_{Mp}^{*} = 150

,

y_{Bp}^{*} = 250

,

θ = - 2

,

n = 22

, core size = 14, core height = 3,

d_{max} = 3

.

Figure 10. An illustration of inferred rank-2 chemical graphs

G^{*}

with

θ = - 2

: (a)

y_{K ow}^{*} = 5

,

θ = - 2

,

n = 30

, core size = 16, core height = 3,

d_{max} = 4

; (b)

y_{Mp}^{*} = 250

,

θ = - 2

,

n = 30

, core size = 16, core height= 2,

d_{max} = 3

; (c)

y_{Bp}^{*} = 150

,

θ = - 2

,

n = 25

, core size = 17, core height = 4,

d_{max} = 3

; (d)

y_{K ow}^{*} = 5

,

y_{Mp}^{*} = 150

,

y_{Bp}^{*} = 250

,

θ = - 2

,

n = 22

, core size = 14, core height = 3,

d_{max} = 3

.

Figure 11. An illustration of inferred rank-2 chemical graphs

G^{*}

: (a)

y_{K ow}^{*} = 5

,

θ = 0

,

n = 30

, core size = 14, core height = 2,

d_{max} = 3

; (b)

y_{Mp}^{*} = 250

,

θ = 0

,

n = 30

, core size = 16, core height = 2,

d_{max} = 4

; (c)

y_{Bp}^{*} = 150

,

θ = 0

,

n = 25

, core size = 17, core height = 2,

d_{max} = 3

.

Figure 11. An illustration of inferred rank-2 chemical graphs

G^{*}

: (a)

y_{K ow}^{*} = 5

,

θ = 0

,

n = 30

, core size = 14, core height = 2,

d_{max} = 3

; (b)

y_{Mp}^{*} = 250

,

θ = 0

,

n = 30

, core size = 16, core height = 2,

d_{max} = 4

; (c)

y_{Bp}^{*} = 150

,

θ = 0

,

n = 25

, core size = 17, core height = 2,

d_{max} = 3

.

Figure 12. An illustration of inferred rank-2 chemical graphs

G^{*}

: (a)

y_{K ow}^{*} = 5

,

θ = 2

,

n = 30

, core size = 15, core height = 5,

d_{max} = 4

; (b)

y_{Mp}^{*} = 250

,

θ = 2

,

n = 30

, core size = 17, core height = 2,

d_{max} = 3

; (c)

y_{Bp}^{*} = 150

,

θ = 2

,

n = 25

, core size = 17, core height = 3,

d_{max} = 3

.

Figure 12. An illustration of inferred rank-2 chemical graphs

G^{*}

: (a)

y_{K ow}^{*} = 5

,

θ = 2

,

n = 30

, core size = 15, core height = 5,

d_{max} = 4

; (b)

y_{Mp}^{*} = 250

,

θ = 2

,

n = 30

, core size = 17, core height = 2,

d_{max} = 3

; (c)

y_{Bp}^{*} = 150

,

θ = 2

,

n = 25

, core size = 17, core height = 3,

d_{max} = 3

.

Table 1. Results of Step 1 in Phase 1.

$π$	$\| D_{π} \|$	$Λ$	$\| Γ \|$	$[\underset{̲}{n}, \bar{n}]$	$[\underset{̲}{cs}, \bar{cs}]$	$[\underset{̲}{ch}, \bar{ch}]$	$[\underset{̲}{θ}, \bar{θ}]$	$[\underset{̲}{a}, \bar{a}]$
Kow	93	`C,N,O`	9	[9, 31]	[7, 16]	[0, 13]	[−5, 3]	[ $- 3.7$ , 12.2]
Mp	63	`C,N,O`	7	[9, 31]	[7, 17]	[0, 4]	[−6, 3]	[−80, 300]
Bp	45	`C,N,O,S,P,Cl`	9	[9, 25]	[7, 15]	[0, 7]	[−4, 3]	[155, 420]

Table 2. Results of Steps 2 and 3 in Phase 1.

$π$	k	Activation	Architecture	L-Time	Test $R^{2}$ (ave.)	(Best)
Kow	37	relu	(37,10,1)	3.92	0.866	0.964
Mp	33	relu	(33,10,1)	21.68	0.805	0.916
Bp	43	relu	(43,10,1)	11.88	0.802	0.947

Table 3. Results of Steps 4 and 5 with

d_{\max} = 3

and

θ = - 2

.

Table 3. Results of Steps 4 and 5 with

d_{\max} = 3

and

θ = - 2

.

$π$	$y^{*}$	$n^{*}$	$\| F^{*} \| / #$ I	IP-Time	$# G^{*}$	G-Time
Kow	5	15	12/12	9.96	100	2236.0
Kow	5	20	12/12	30.38	12	$>$ 1 h
Kow	5	25	12/12	47.57	12	$>$ 1 h
Kow	5	30	12/12	69.38	12	$>$ 1 h
Mp	150	15	12/12	9.52	100	2069.0
Mp	150	20	12/12	22.79	12	$>$ 1 h
Mp	150	25	12/12	47.20	12	$>$ 1 h
Mp	150	30	12/12	66.90	12	$>$ 1 h
Bp	250	15	11/12	9.50	100	103.5
Bp	250	19	12/12	19.08	12	$>$ 1 h
Bp	250	22	12/12	25.78	12	$>$ 1 h
Bp	250	25	12/12	67.64	12	$>$ 1 h

Table 4. Results of Steps 4 and 5 with

d_{\max} = 4

and

θ = - 2

.

Table 4. Results of Steps 4 and 5 with

d_{\max} = 4

and

θ = - 2

.

$π$	$y^{*}$	$n^{*}$	$\| F^{*} \| / #$ I	IP-Time	$# G^{*}$	G-Time
Kow	5	15	11/12	31.84	100	413.8
Kow	5	20	12/12	69.65	12	$>$ 1 h
Kow	5	25	12/12	144.20	11	$>$ 1 h
Kow	5	30	12/12	352.01	12	$>$ 1 h
Mp	150	15	9/12	20.68	100	947.4
Mp	150	20	11/12	73.73	11	$>$ 1 h
Mp	150	25	9/12	140.09	9	$>$ 1 h
Mp	150	30	12/12	304.04	12	$>$ 1 h
Bp	250	15	7/12	28.51	100	232.7
Bp	250	19	11/12	82.01	11	$>$ 1 h
Bp	250	22	12/12	150.55	12	$>$ 1 h
Bp	250	25	12/12	239.84	12	$>$ 1 h

Table 5. Results of Steps 4 and 5 with

d_{\max} = 3

and

θ = 0

.

Table 5. Results of Steps 4 and 5 with

d_{\max} = 3

and

θ = 0

.

$π$	$y^{*}$	$n^{*}$	$\| F^{*} \| / #$ I	IP-Time	$# G^{*}$	G-Time
Kow	5	15	12/12	11.00	100	121.1
Kow	5	20	12/12	25.64	12	$>$ 1 h
Kow	5	25	12/12	38.79	12	$>$ 1 h
Kow	5	30	12/12	49.65	12	$>$ 1 h
Mp	150	15	12/12	8.45	100	373.4
Mp	150	20	12/12	18.94	12	$>$ 1 h
Mp	150	25	12/12	37.13	12	$>$ 1 h
Mp	150	30	12/12	44.745	4	$>$ 1 h
Bp	250	15	9/12	8.450	100	74.2
Bp	250	19	11/12	16.31	11	$>$ 1 h
Bp	250	22	12/12	21.71	12	$>$ 1 h
Bp	250	25	12/12	45.80	12	$>$ 1 h

Table 6. Results of Steps 4 and 5 with

d_{\max} = 4

and

θ = 0

.

Table 6. Results of Steps 4 and 5 with

d_{\max} = 4

and

θ = 0

.

$π$	$y^{*}$	$n^{*}$	$\| F^{*} \| / #$ I	IP-Time	$# G^{*}$	G-Time
Kow	5	15	9/12	36.33	100	23.2
Kow	5	20	12/12	82.01	12	$>$ 1 h
Kow	5	25	12/12	138.96	12	$>$ 1 h
Kow	5	30	12/12	292.79	12	$>$ 1 h
Mp	150	15	9/12	19.89	100	557.6
Mp	150	20	11/12	63.62	11	$>$ 1 h
Mp	150	25	12/12	112.49	12	$>$ 1 h
Mp	150	30	12/12	171.11	12	$>$ 1 h
Bp	250	15	3/12	34.60	100	11.2
Bp	250	19	6/12	203.65	6	$>$ 1 h
Bp	250	22	9/12	218.07	9	$>$ 1 h
Bp	250	25	11/12	783.80	11	$>$ 1 h

Table 7. Results of Steps 4 and 5 with

d_{\max} = 3

and

θ = 2

.

Table 7. Results of Steps 4 and 5 with

d_{\max} = 3

and

θ = 2

.

$π$	$y^{*}$	$n^{*}$	$\| F^{*} \| / #$ I	IP-Time	$# G^{*}$	G-Time
Kow	5	15	12/12	11.64	100	1386.7
Kow	5	20	12/12	23.84	12	>1 h
Kow	5	25	12/12	33.71	12	>1 h
Kow	5	30	12/12	61.85	12	>1 h
Mp	150	15	12/12	9.80	100	1614.3
Mp	150	20	12/12	20.15	12	>1 h
Mp	150	25	12/12	36.42	12	>1 h
Mp	150	30	12/12	40.58	12	>1 h
Bp	250	15	11/12	10.25	100	1756.1
Bp	250	19	12/12	16.02	12	>1 h
Bp	250	22	12/12	23.63	12	>1 h
Bp	250	25	12/12	63.84	12	>1 h

Table 8. Results of Steps 4 and 5 with

d_{\max} = 4

and

θ = 2

.

Table 8. Results of Steps 4 and 5 with

d_{\max} = 4

and

θ = 2

.

$π$	$y^{*}$	$n^{*}$	$\| F^{*} \| / #$ I	IP-Time	$# G^{*}$	G-Time
Kow	5	15	11/12	28.15	100	20.3
Kow	5	20	12/12	71.90	12	$>$ 1 h
Kow	5	25	12/12	112.71	12	$>$ 1 h
Kow	5	30	12/12	267.21	12	$>$ 1 h
Mp	150	15	9/12	22.53	100	2748.1
Mp	150	20	11/12	53.44	11	>1 h
Mp	150	25	12/12	143.33	12	>1 h
Mp	150	30	12/12	220.63	12	>1 h
Bp	250	15	6/12	27.33	100	254.2
Bp	250	19	9/12	75.50	9	>1 h
Bp	250	22	11/12	133.01	11	>1 h
Bp	250	25	12/12	228.75	12	>1 h

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhu, J.; Wang, C.; Shurbevski, A.; Nagamochi, H.; Akutsu, T. A Novel Method for Inference of Chemical Compounds of Cycle Index Two with Desired Properties Based on Artificial Neural Networks and Integer Programming. Algorithms 2020, 13, 124. https://doi.org/10.3390/a13050124

AMA Style

Zhu J, Wang C, Shurbevski A, Nagamochi H, Akutsu T. A Novel Method for Inference of Chemical Compounds of Cycle Index Two with Desired Properties Based on Artificial Neural Networks and Integer Programming. Algorithms. 2020; 13(5):124. https://doi.org/10.3390/a13050124

Chicago/Turabian Style

Zhu, Jianshen, Chenxi Wang, Aleksandar Shurbevski, Hiroshi Nagamochi, and Tatsuya Akutsu. 2020. "A Novel Method for Inference of Chemical Compounds of Cycle Index Two with Desired Properties Based on Artificial Neural Networks and Integer Programming" Algorithms 13, no. 5: 124. https://doi.org/10.3390/a13050124

APA Style

Zhu, J., Wang, C., Shurbevski, A., Nagamochi, H., & Akutsu, T. (2020). A Novel Method for Inference of Chemical Compounds of Cycle Index Two with Desired Properties Based on Artificial Neural Networks and Integer Programming. Algorithms, 13(5), 124. https://doi.org/10.3390/a13050124

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Novel Method for Inference of Chemical Compounds of Cycle Index Two with Desired Properties Based on Artificial Neural Networks and Integer Programming

Abstract

1. Introduction

2. Materials and Methods

2.1. Preliminary

2.1.1. Multigraphs and Graphs

Multigraphs

Graphs

2.1.2. Modeling of Chemical Compounds

Chemical Graphs

Descriptors

2.2. A Method for Inferring Chemical Graphs

2.3. Representing Rank-2 Chemical Graphs

2.3.1. Scheme Graphs and Tree-Extensions

Scheme Graphs

Extensions of Scheme Graphs

Tree-Extensions

2.3.2. MILPs for Rank-2 Chemical Graphs

3. Results

4. Discussion

Author Contributions

Funding

Conflicts of Interest

Appendix A. All Constraints in an MILP Formulation for Rank-2 Chemical Graphs

Appendix A.1. Applicability Domain

Appendix A.2. Construction of Scheme Graph and Tree-Extension

Appendix A.3. Specification for Chemical Graphs with Rank 2

Appendix A.4. Selecting A Subgraph

Appendix A.5. Assigning Multiplicity

Appendix A.6. Assigning Chemical Elements and Valence Condition

Appendix A.7. Descriptors for Mass, the Numbers of Elements and Bonds

Appendix A.8. Descriptor for the Number of Specified Degree

Appendix A.9. Descriptor for the Number of Adjacency-Configurations

Appendix A.10. Descriptor for 1-Path Connectivity

Appendix A.11. Constraints for Left-Heavy Trees

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI