An Inverse QSAR Method Based on a Two-Layered Model and Integer Programming

A novel framework for inverse quantitative structure–activity relationships (inverse QSAR) has recently been proposed and developed using both artificial neural networks and mixed integer linear programming. However, classes of chemical graphs treated by the framework are limited. In order to deal with an arbitrary graph in the framework, we introduce a new model, called a two-layered model, and develop a corresponding method. In this model, each chemical graph is regarded as two parts: the exterior and the interior. The exterior consists of maximal acyclic induced subgraphs with bounded height, the interior is the connected subgraph obtained by ignoring the exterior, and the feature vector consists of the frequency of adjacent atom pairs in the interior and the frequency of chemical acyclic graphs in the exterior. Our method is more flexible than the existing method in the sense that any type of graphs can be inferred. We compared the proposed method with an existing method using several data sets obtained from PubChem database. The new method could infer more general chemical graphs with up to 50 non-hydrogen atoms. The proposed inverse QSAR method can be applied to the inference of more general chemical graphs than before.


An MILP Formulation for Inferring a Target Chemical Graph in
Stage 4

Constructing Target Chemical Graphs
This section describes how to construct a target chemical graph in Stages 4 and 5.

Formulating an MILP for a prediction function in Stage 4
In Stage 3, we construct a prediction function η N : R K → R. It is known that the computation process of η N (x) from a vector x * ∈ R K can be formulated as an MILP with the following property.
Theorem 1. ( [1,2]) Let N be an ANN with a piecewise-linear activation function for an input vector x ∈ R K , n A denote the number of nodes in the architecture and n B denote the total number of breakpoints over all activation functions. Then there is an MILP M(x, y; C 1 ) that consists of variable vectors x ∈ R K , y ∈ R, and an auxiliary variable vector z ∈ R p for some integer p = O(n A + n B ) and a set C 1 of O(n A + n B ) constraints on these variables such that: η N (x * ) = y * if and only if there is a vector (x * , y * ) feasible to M(x, y; C 1 ).
Solving this MILP delivers a vector x * ∈ R K such that η N (x * ) = y * for a target value y y y * . However, the resulting vector x * may not admit a chemical graph G * such that f (G * ) = x * . To ensure that such chemical graph always exists in Stage 4, we further introduce some more constraints for a set of new variables in the next section.

Formulating an MILP for a feature vector and a target specification in Stage 4
In this section, we show an outline of formulation of an MILP that represents the computation process of a feature function f (G) from a chemical graph G and a construction of a target chemical graph G ∈ G(G C , σ int , σ ce ). Recall that the number of vertices in a target chemical graph is bounded by an upper bound n * in a specification (G C , σ int , σ ce ). However, if we introduce a set of (n * ) 2 variables for all pairs of n * vertices to present all possible graphs for a target chemical graph, then the resulting MILP formulation is hard to solve for n * > 20 due to a larger number of variables and constraints. To overcome this, a sparse representation of chemical graphs has been proposed in the previous applications of the framework for acyclic graphs [3] and ρ-lean graphs [4]. We also define a similar sparse representation to formulate an MILP for our two-layered model.

Scheme Graphs
We first regard a given seed graph G C as a digraph and then add some more vertices and edges to construct a digraph, called a scheme graph SG = (V, E) so that any (σ int , σ ce )-extension H of G C can be chosen as a subgraph of SG.
For a given target specification (G C , σ int , σ ce ), define integers that determine the size of a scheme graph SG as follows. m C := |E C |, t C := |V C |, t T := n int UB − |V C |, and t F : : Formally the scheme graph SG = (V, E) is defined with a vertex set V = V C ∪ V T ∪ V F and an edge set E = E C ∪ E T ∪ E F ∪ E CT ∪ E TC ∪ E CF ∪ E TF that consist of the following sets. See Figure 1 for an illustration of these sets.
Construction of a σ int -extension H * of G C : Denote the vertex set V C and the edge set E C in the seed graph G C by V C = {v C i | i ∈ [1, t C ]} and E C = {a i | i ∈ [1, m C ]}, respectively, where V C is always included in H * . For including additional interior-vertices in H * , introduce a path P T = (V T = {v T 1 , v T 2 , . . . , v T t T }, E T = {e T 2 , e T 3 , . . . , e T t T }) of length t T − 1 and a set E CT (resp., E TC ) of directed edges e CT i, is allowed to be replaced with a pure path P k from vertex v C i to vertex v C i ′ that visits a set of consecutive vertices v T j , v T j+1 , . . . , v T j+p ∈ V T and edge e TC i,j = (v C i , v T j ) ∈ E CT , then edges e T j+1 , e T j+2 , . . . , e T j+p ∈ E T and finally edge The vertices in V T selected in the path will be vertices in H * .

Appending leaf paths with additional interior-edges in a
and edges e F j+1 , e F j+2 , . . . , e F j+p ∈ E F . In H, the edges and the vertices selected in the path Q are regarded as interior-edges and interior-vertices, respectively.
Construction of ρ-fringe-trees in a (σ int , σ ce )-extension G of G C : In H, the root of a ρ-fringetree can be any vertex in we choose a chemical rooted tree T from the specified set F(v) (resp., F E ).

Theorem 2.
Let (G C , σ int , σ ce ) be a target specification and φ * = |Λ int (D π )|+|Λ ex (D π )|+|Γ int (D π )|+ |F * | for sets of chemical elements, edge-configurations and fringe-configurations in σ ce . Then there is an MILP M(x, g; C 2 ) that consists of variable vectors x ∈ R K * and g ∈ R q for an integer q = O(n int Note that our MILP requires only O(n * ) variables and constraints when the branch-parameter ρ, integers |E C |, n int UB and φ * are constant. We explain the basic idea of our MILP that satisfies Theorem 2. The MILP mainly consists of the following three types of constraints. C1. Constraints for selecting an underlying graph H of a chemical graph G ∈ G(G C , σ int , σ ce ) as a subgraph of the scheme graph SG; C2. Constraints for assigning chemical elements to interior-vertices and multiplicity to interior-edges to determine a chemical graph G = (H, α, β); and C3. Constraints for computing descriptors in the feature vector f (G) of the selected chemical graph G.
In the constraints of C1, more formally we prepare the following. Variables: Constraints: -linear constraints so that each ρ-fringe-tree rooted at a vertex v X i in a graph H from SG is selected from the given set F(v C i ) for X=C (or F E for X ∈ {T, F}); -linear constraints such that each edge e C i = a i ∈ E (=1) is always used as an edge in H and each edge e C i = a i ∈ E (0/1) is used as an edge in H if necessary; -linear constraints such that for each edge -linear constraints such that for each edge H by a pure path P k as in the case of edges in E (≥2) ; -linear constraints for selecting a leaf path . . , e F j+p for some integers j and p. In the constraints of C2, we prepare an integer variable α X (i) for each vertex v X i ∈ V, X ∈ {C, T, F} in the scheme graph that represents the chemical element This determines a chemical graph G = (H, α, β). Also we include constraints for a selected chemical graph G to satisfy the valence condition at each interior-vertex v with the edge-configurations ec(e) of the edges e incident to v and the chemical specification σ ce .
In the constraints of C3, we introduce a variable for each descriptor and constraints with some more variables to compute the value of each descriptor in f (G) for a selected chemical graph G.
The details of the MILP can be found in Section 3.  This section briefly reviews the method [4] for Stage 5. Let G † be a chemical graph that is a (σ int , σ ce )-extension of a seed graph G C = (V C , E C ), where we denote by E (=0) the set of the edges in E (0/1) that are not used in G † . We define a base-graph G B = (V B , E B ) to be the seed graph (V C , E C \ E (=0) ) after removing the edges in E (=0) . We call a chemical graph G * a chemical isomer

A Dynamic Programming Algorithm for Generating Isomers in
GB=GC-E (=0) Figure 3: An illustration of generating a chemical isomer G * of a chemical graph G † with a base-graph The method generates chemical isomers G * of G † in the following way, where Figure 3 illustrates the whole process in the case of 1. We first decompose a given chemical graph G † into a collection of chemical rooted or bi-rooted trees.
-For each vertex v ∈ V B , let T † v denote the chemical rooted tree rooted at v in G that is constructed with a leaf path Q v and fringe-trees attached to Q v . Possibly T † v consists of a single vertex v and we call such a tree trivial.
-For each edge a = uv ∈ E (≥2) ∪ E (≥1) , let T † a denote the chemical bi-rooted tree rooted at vertices u and v in G that consists of a pure u, v-path P a , leaf paths rooted at internal vertices in P a and fringe-trees attached to theses leaf paths. Possibly T † a consists of a single edge a and we call such a tree trivial. Figure 4 illustrates the non-trivial chemical trees and then generate a set T t of all (or a limited number of) chemical acyclic graphs T * t such that f (T * t ) = x * t and the structure of T * t satisfies the lower and upper bounds in the interiorspecification σ int by using the dynamic programming algorithm for chemical acyclic graphs [3].

For each combination of chemical trees
where we ignore a possible automorphism of the resulting graphs G * .
The above method [4] can be used to generate chemical isomers in Stage 5 in our two-layered model by making a minor modification to the definition of a feature vector f (G).  Figure 2, where the gray squares indicate the roots of these rooted and bi-rooted trees.

All Constraints in an MILP Formulation for Chemical Graphs
We define a standard encoding of a finite set A of elements to be a bijection σ :

Selecting a Cyclical-base
Recall that -For each edge a i ∈ E (≥1) , either edge a i is directly used in G or the above path P i of length at least 2 is constructed in G.
For each directed edge a i ∈ E C , let head(i) and tail(i) denote the head and tail of e C (i); i.e., To control the construction of such a path P i for each edge as the "color" of the edge. To introduce necessary linear constraints that can construct such a path P k properly in our MILP, we assign the color k to the constants: : lower and upper bounds on the length of path P k ; variables: constraints:

Constraints for Including Leaf Paths
Let t C denote the number of vertices u ∈ V C such that bl UB (u) = 1 and assume that Define the set of colors for the vertex set : a lower bound on the number of leaf ρ-branches in the leaf path rooted at a vertex v C i ; -n int G ∈ [n int LB , n int UB ]: the number of interior-vertices in G; as an internal vertex and the ρ-fringe-tree rooted at v T i contains a leaf ρ-branch; constraints:

Constraints for Including Fringe-trees
To express the condition that the ρ-fringe-tree is chosen from a rooted tree C i , T i or F i , we introduce the following set of variables and constraints.  -n G ∈ [n LB , n * ]: n(G); , X ∈ {C, T, F}: the number of children of the root of the ρ-fringe-tree with color k has the largest height among such trees;

Descriptor for the Number of Specified Degree
We include constraints to compute descriptors dg int d (G), d ∈ [1,4].

Assigning Multiplicity
We prepare an integer variable β(e) for each edge e in the scheme graph SG to denote the bondmultiplicity of e in a selected graph G and include necessary constraints for the variables to satisfy in G.

constants:
-β r ([ψ]): the sum of bond-multiplicities of edges incident to the root of a tree ψ ∈ F * ; variables:  constraints:

Assigning Chemical Elements and Valence Condition
We include constraints so that each vertex u in a selected graph H satisfies the valence condition; i.e., ∑ uv∈E(H) β(uv) ≤ val(α(u)). With these constraints, a chemical graph G = (H, α, β) on a selected subgraph H will be constructed. variables: : the bond-multiplicity of edge e CT j,i (resp., e TC j,i ) if one exists; constraints:

Constraints for Bounds on the Number of Bonds
We include constraints for specification of lower and upper bounds bd LB and bd UB .

Descriptor for the Number of Adjacency-configurations
We call a tuple (a, b, m) ∈ Λ × Λ × [1,3] an adjacency-configuration. The adjacency-configuration of an edge-configuration (µ = ad, µ ′ = bd ′ , m) is defined to be (a, b, m). We include constraints to compute the frequency of each adjacency-configuration in an inferred chemical graph G. constants: -Let γ of an edge-configuration γ = (µ, ξ, m) denote the edge-configuration (ξ, µ, m); -Let Γ int ac,< , Γ int ac,= and Γ int ac,> denote the sets of the adjacency-configurations of edge-configurations in the sets Γ int < , Γ int = and Γ int > , respectively; , ν ∈ Γ int ac : the number of interior-edges with adjacency-configuration ν; , ν ∈ Γ F ac : the number of edges e C ∈ E C (resp., edges e T ∈ E T and edges e F ∈ E F ) with adjacency-configuration ν; , ν ∈ Γ TF ac : the number of edges e CT ∈ E CT (resp., edges e TC ∈ E TC and edges e CF ∈ E CF and e TF ∈ E TF ) with adjacency-configuration ν;  constraints: -Choose subsets Λ C dg , Λ T dg , Λ F dg ⊆ Λ int dg : To compute the frequency of chemical symbols exactly, set Λ C dg := Λ T dg := Λ F dg := Λ int dg ; variables:

Descriptor for the Number of of Fringe-configurations
We include constraints to compute the frequency of each fringe-configuration in an inferred chemical graph G.

Constraints for Normalization of Feature Vectors
By introducing a tolerance ε > 0 in the conversion between integers and reals, we include the following constraints for normalizing of a feature vector f (G) = (x 1 , x 2 , . . . , x K ): An example of a tolerance is ε = 0.01.