Similar Supergraph Search Based on Graph Edit Distance

Masataka Yamada; Akihiro Inokuchi

doi:10.3390/a14080225

and

Graduate School of Science and Technology, Kwansei Gakuin University, 2-1 Gakuen, Sanda 669-1337, Japan

^*

Author to whom correspondence should be addressed.

Algorithms2021, 14(8), 225;https://doi.org/10.3390/a14080225

This article belongs to the Special Issue Scalable Graph Algorithms and Applications

Version Notes

Order Reprints

Abstract

Subgraph and supergraph search methods are promising techniques for the development of new drugs. For example, the chemical structure of favipiravir—an antiviral treatment for influenza—resembles the structure of some components of RNA. Represented as graphs, such compounds are similar to a subgraph of favipiravir. However, the existing supergraph search methods can only discover compounds that match exactly. We propose a novel problem, called similar supergraph search, and design an efficient algorithm to solve it. The problem is to identify all graphs in a database that are similar to any subgraph of a query graph, where similarity is defined as edit distance. Our algorithm represents the set of candidate subgraphs by a code tree, which it uses to efficiently compute edit distance. With a distance threshold of zero, our algorithm is equivalent to an existing efficient algorithm for exact supergraph search. Our experiments show that the computation time increased exponentially as the distance threshold increased, but increased sublinearly with the number of graphs in the database.

Keywords:

labeled graph; supergraph search; similarity graph; graph edit distance

1. Introduction

The coronavirus disease (COVID-19) has been spreading widely since 2019. In Japan, favipiravir (brand name: Avigan) has been examined as a promising antiviral medication against COVID-19. Favipiravir was initially developed as a medication to treat influenza. When Favipiravir is ribosylated, its structure resembles acadesine, which is a precursor to inosine. Inosine is the previous stage of guanosine and adenosine, which are components of RNA [1]. With this treatment, the RNA synthetase of a virus mistakenly incorporates favipiravir, instead of guanosine, into the replicating virus. This causes RNA synthesis to stop and limits the replication of viruses inside the body.

Figure 1 shows the chemical structures of guanosine, inosine, acadesine, and ribosylated favipiravir. The substructure

α

, drawn in red, is common to all of guanosine, inosine, and acadesine. Therefore, in order to find candidate compounds for a new drug, it may be useful to search for compounds containing a substructure, provided as a query, in a database that consist of many compounds (i.e., substructure search). Conversely, if many substructures relevant to new drug development are known, it can be useful to search for substructures contained in a newly developed chemical compound, provided as a query, in a database recording these substructures (i.e., superstructure search). This paper discusses methods related to the latter task. The substructure drawn in blue in Figure 1 is similar to the substructure

α

, but they do not match perfectly. In the situation shown, the substructure

α

, which contributes to anti-influenza agents, is registered in a database. The database contains various substructures for which its medicinal effects, side effects and so on are known. If ribosylated favipiravir is given as the query and the substructure

α

is the output as the search result, we may be able to discover that ribosylated favipiravir is anti-influenza. Therefore, such a search engine for chemical compounds would be very useful and effective.

Figure 1. Similar Chemical Compounds.

A compound can be represented as a graph, in which the vertices and edges correspond to the atoms and chemical bonds, respectively. The aforementioned substructure and superstructure search correspond to subgraph search [2,3,4,5,6,7,8,9,10,11,12,13,14] and supergraph search [15,16,17,18,19,20,21,22], respectively, which are actively being studied by researchers. A graph is a highly versatile data structure for representing information in many fields. It can be used to represent human relationship networks, protein interaction networks, resource description frameworks (RDF), computer-aided designs (CAD), and computer vision images; moreover, subgraph and supergraph search are used for many applications in addition to searching for compounds. However, there is no method for searching for graphs that are similar to subgraphs of a query graph and no method applicable to the resemblance between substructures shown in Figure 1. This paper proposes the problem of similar supergraph search and also proposes a method to solve it.

In this paper, we measure the similarity between two graphs g and

g^{'}

by the graph edit distance. The graph edit distance is the minimum number of edits required to transform g to another

g^{'}

, where an edit is either an insertion, deletion, or relabeling of a vertex or edge. The number of graphs obtained by editing the graph g

θ

times increases drastically, especially when the numbers of vertices and edges are large and the varieties of vertex and edge labels are wide. Such a drastic increase in edited graphs makes it difficult to find similar graphs. However, some real applications need to obtain a subset of graphs edited

θ

times. For example, editing the substructure in a broken ellipse of Figure 1 is not admissible, whereas editing the remainder of the blue substructure is admissible. By restricting an editable part in the structure, the number of graphs obtained by editing the graph g

θ

times is decreased. In our proposed method, users can search for their desired type of similar graphs in a database containing graphs and this search can be realized by simply rewriting only one function without the need to modify our proposed algorithm. Therefore, the method proposed in this paper is a general and flexible framework for searching for similar subgraphs of query graphs and is easily customizable for searching for the types of graphs desired by the user.

The remainder of the paper is organized as follows. Section 2 formalizes the novel problem of searching for graphs contained in a query graph in a database storing many graphs and Section 3 discusses related work. In Section 4 and Section 5, we propose straightforward methods for a simpler problem in which the searched graphs are constrained. In Section 6 and Section 7, we propose a novel method for solving the original problem by extending the above methods. In Section 8, we analyze the computational efficiency of the proposed method by using real-world datasets. Finally, we conclude the paper in Section 9.

2. Problem Definition

A labeled graph is expressed as

g = (V, E, Σ, ℓ)

, where V is a set of vertices,

E \subseteq V \times V

is a set of edges,

Σ

is a set of labels for the vertices and edges, and

ℓ : V \cup E \to Σ

is a function to assign the labels to the vertices and edges. We denote the set of all vertices in graph g as

V (g)

. A graph g is called a directed graph if the edges in g have directions; otherwise, g is called an undirected graph. A sequence of edges from

v_{1}

to

v_{2}

is called a path and a graph (for a directed graph, all edges in the graph are replaced with undirected edges.) and a path that exists between any two vertices is called connected.

Given graphs

g = (V, E, Σ, ℓ)

and

g^{'} = (V^{'}, E^{'}, Σ^{'}, ℓ^{'})

,

\forall v, v_{1}, v_{2} \in V^{'}

, when an injective function

ϕ : V^{'} \to V

fulfills the following conditions,

g^{'}

is called a subgraph of g, which is denoted as

g^{'} \subseteq g

:

$(ϕ (v_{1}), ϕ (v_{2})) \in E$ if $(v_{1}, v_{2}) \in E^{'}$ ;
$ℓ^{'} (v) = ℓ (ϕ (v))$ ;
$ℓ^{'} ((v_{1}, v_{2})) = ℓ ((ϕ (v_{1}), ϕ (v_{2})))$ .

In addition, when

(ϕ (v_{1}), ϕ (v_{2})) \in E

iff

(v_{1}, v_{2}) \in E^{'}

, we call

g^{'}

an induced subgraph of g, which is denoted as

g^{'} \subseteq_{i} g

. Although this paper focuses only on undirected and connected graphs, the methods proposed can easily be extended to directed or unconnected graphs.

The graph edit distance is the minimum number of edits required to transform a graph g to another graph

g^{'}

, where an edit is either an insertion, deletion, or relabeling of a vertex or edge. In this paper, we assume that the cost of each edit is 1. The number of edits between graphs

g = (V, E, Σ, ℓ)

and

g^{'} = (V^{'}, E^{'}, Σ, ℓ^{'})

is defined as follows:

d i s t (g, g^{'}, φ) = \sum_{i = 1}^{| V^{'} |} c_{i φ (i)} + \sum_{i = 1}^{| V^{'} | - 1} \sum_{j = i + 1}^{| V^{'} |} c ((v_{i}, v_{j}) \to (v_{φ (i)}, v_{φ (j)})),

(1)

where we assume that

| V | \leq | V^{'} |

and V is extended to

V^{+} = V \cup {ε_{1}, ε_{2}, \dots, ε_{| V^{'} | - | V |}}

[23]. In addition, when g is transformed to

g^{'}

, the mapping between V and

V^{'}

is expressed as a bijective mapping

φ : V^{+} \to V^{'}

.

c_{i j}

is the cost to edit the vertex

v_{i} \in V^{+}

to

v_{j} \in V^{'}

and

c ((v_{i}, v_{j}) \to (v_{i^{'}}, v_{j^{'}}))

is the cost to edit edge

(v_{i}, v_{j}) \in E

to

(v_{i^{'}}, v_{j^{'}}) \in E^{'}

. Editing

ε_{i} \in V^{+}

to

v_{j} \in V^{'}

means performing a vertex insertion in

g^{'}

. Therefore, the graph edit distance

e d (g, g^{'})

between the graphs g and

g^{'}

is defined as follows:

e d (g, g^{'}) = min_{φ \in Φ (| V^{'} |, | V^{'} |)} d i s t (g, g^{'}, φ),

(2)

where

Φ (α, β)

is a set of all possible

β

-permutations of the integers

1, 2, \dots, α

. The solution to Equation (2) is one of the quadratic assignment problems, which are known to be NP-complete.

Given a set of graphs

G = {g_{1}, g_{2}, \dots, g_{n}}

, a query graph q, and a threshold

θ

as input, the problem tackled in this paper is to output a set of graphs.

S = {g_{i} \in G ∣ \exists q^{'} \subseteq q s . t . e d (g_{i}, q^{'}) \leq θ} .

(3)

In Equation (3), note that

q^{'}

is a subgraph of q but not an induced subgraph of q. When

θ

is 0 in Equation (3), the problem is equivalent to the original supergraph search problem. Therefore, our problem, represented by Equation (3), is an extension of the supergraph search problem.

3. Related Work

In Section 2, we defined the problem tackled in this paper. Recently, much attention has been focused on searching for graphs satisfying some conditions in a database consisting of multiple graphs. Table 1 summarizes graph search problems and the publications that tackle each problem. The subgraph search problem is to find graphs that contain the query graph and can be expressed as follows:

{g_{i} \in G ∣ q \subseteq g_{i}},

whereas the supergraph search problem is to find graphs that the query graph contains and it can be expressed as follows.

{g_{i} \in G ∣ g_{i} \subseteq q} .

Table 1. Summary of graph search problems and references.

These problems contain the subgraph isomorphism problem, which is known to be NP-complete. The similar graph search problem is to find graphs that are similar to the query graph, and can be expressed as

{g_{i} \in G ∣ e d (g_{i}, q) \leq θ} .

This problem contains the graph edit distance problem, which is known to be NP-complete. Various methods have been proposed to solve these problems. The problem tackled in this paper is to find graphs that are similar to subgraphs of the query graph and contains both the subgraph isomorphism problem and graph edit distance problem. This problem is a novel one, to the best of our knowledge. Since the subgraph isomorphism problem and graph edit distance problem are NP-complete, a naive method for solving either of the problems for a given query graph and for each graph in the database would be unrealistic.

Most of the methods proposed in the articles cited in Table 1 are based on filtering and verification and an index in each of the methods has a set of patterns P extracted from the database in advance. For example, for the supergraph search problem, in the filtering phase,

G^{'} = G ∖ ⋃_{p \in P} {g_{i} \in G ∣ p ⊈ q, g_{i} \subseteq q}

is obtained. Subsequently, the verification phase checks whether

g \subseteq q

, for each graph

g \in G^{'}

. Figure 2a shows four graphs in a database and two patterns contained in an index. The index has the information that

p_{1} \subset g_{1}

,

p_{2} \subset g_{2}

,

p_{1} \subset g_{4}

, and

p_{2} \subset g_{4}

. Given the query shown in Figure 2c, supergraph search methods first check whether

p_{1} \subseteq q

and

p_{2} \subseteq q

and then obtain

G^{'} = {g_{1}, g_{3}}

in the filtering phase. In the verification phase, they test whether

g_{1} \subseteq q

and

g_{3} \subseteq q

. Since the number of vertices in

p_{2}

is less than one for

g_{2}

, the computation time to check whether

p_{2} \subseteq q

becomes less than one for

g_{2} \subseteq q

and we do not need to check whether

g_{2} \subseteq q

in the filtering phase. The patterns in P are either paths, trees, cycles, or graphs extracted from databases. The patterns are extracted by the enumeration or mining method. The enumeration method enumerates all subgraphs in graphs in a database, where the subgraphs are restricted to either paths, trees, cycles, or graphs. In contrast, the mining method enumerates all frequent patterns [24], which are defined as

F (G, σ^{'}) = {p ∣ p \subseteq g_{i}, g_{i} \in G, σ (p, G) \geq σ^{'}}

from graphs in a database, where

σ (p, G)

is defined as the number of graphs in G containing pattern p:

σ (p, G) = | {g_{i} \in G ∣ p \subseteq g_{i}} | .

Figure 2. Example of supergraph search.

The extracted patterns are stored in a tree or lattice structure in the index of the graph database. Enumerating patterns and mining patterns from a database are time-consuming processes.

CodeTree [17] is a method for the supergraph search problem. It does not distinguish between filtering and verification phases; it filters out graphs that are not contained in a given query graph while verifying unfiltered graphs. In addition, CodeTree does not enumerate or mine patterns in a database. All graphs in a database are stored in the form of one of the graph codes, such as AcGM [33] code or DFS [34] code, in the index in CodeTree. The AcGM and DFS codes are based on adjacency matrices and adjacency lists, which are alternatives to graph representations. The algorithms for constructing an index for a database and searching graphs in the index are independent of the graph codes and, thus, the method can be integrated and extended with various graph codes for representing graphs in accordance with the intended characteristics of those codes. For this reason, the method proposed in this paper is based on CodeTree.

In this paper, we focus on the problems in which each database consists of multiple graphs. Other subgraph search problems are to search for a database which consists of “a single large graph”. Since the number of graphs given as input is two, we need not distinguish the subgraph and supergraph searches. A typical problem of the graph search with a single large graph is to output all embeddings (or occurrences)

{(ϕ (v_{1}), ϕ (v_{2}), \dots, ϕ (v_{| V (q) |}))}

in a large graph g for a query q consisting of vertices

(v_{1}, v_{2}, \dots, v_{| V (q) |})

when the query q is a subgraph of the large graph g. This problem is applicable to RDFs (Resource Description Frameworks), where each subject or object and predicate corresponds to a vertex and edge in a graph, respectively; PPI (Protein-Protein Interaction) networks, where each protein and interaction correspond to a vertex and edge, respectively; and intrusion alert networks where each computer and possible attack such as Denial-of-Service and TCP Service Sweep correspond a vertex and edge, respectively. Some of the methods for solving these kinds of problems use not only exact matching [35,36,37] but also approximate matching [38,39,40,41] with various similarity measures.

The most representative metric for measuring similarity between two graphs is the graph edit distance. Most of methods for graph search problems use this graph edit distance to measure similarities among graphs. Other most representative measures are to use maximum common subgraphs or graph kernels [42,43] between two graphs. There are two types of problems called maximum common induced subgraph problem [44] and maximum common edge subgraph problem [45] and the problems for finding the maximum common subgraphs in two graphs is NP-complete. In contrast, graph kernels have been proposed to classify graphs in the machine learning domain. The random walk graph kernel [46] returns a high similarity of graphs if random walks on the graphs produce similar sequences of vertex and edge labels. The Weisfeiler–Lehman graph kernel [47] uses the Weisfeiler–Lehman test, which was proposed to solve the graph isomorphism problem. The kernel iteratively relabels each vertex v by concatenating a label of v and labels of v’s adjacent vertices and returns high similarity for two relabeled graphs if the graphs have many common labels.

4. Straightforward Method for Constrained Problem

In Section 2, we defined the problem tackled in this paper. Here, for the sake of simplicity, we first provide methods for outputting the following:

{g_{i} \in G ∣ \exists q^{'} \subseteq_{i} q s . t . e d (g_{i}, q^{'}) \leq θ, q^{'} is connected, | V (g_{i}) | \leq | V (q^{'}) |},

(4)

by constraining the original problem. In the subsequent sections, we explain how the method can be extended to solve the original problem.

Definition 1

(AcGM code [33]). Given a graph

g = (V, E, Σ, ℓ)

, we assign vertex identifiers

u_{1}, u_{2}, \dots, u_{| V |}

to vertices in g and assume that a subgraph of g induced by vertices

u_{1}, u_{2}, \dots, u_{i} (1 \leq i \leq | V |)

is connected. g is represented by an adjacency matrix of dimension

| V | \times | V |

for which its elements

x_{i, j}

are

ℓ ((u_{i}, u_{j}))

if

(u_{i}, u_{j}) \in E

or 0 otherwise. The AcGM code of graph g is defined by concatenating the elements in the upper-right part of the adjacency matrix, as follows:

c o d e (g, ⟨ u_{1}, u_{2}, \dots, u_{| V |} ⟩) = s_{1} s_{2} \dots s_{| V |},

(5)

where each code fragment

s_{i}

is defined as follows:

s_{i} = ℓ (u_{i}) x_{1, i} x_{2, i} \dots x_{i - 2, i} x_{i - 1, i} .

s_{1} s_{2} \dots s_{i}

is called a prefix of Equation (5). In addition,

c_{2} \subseteq c_{1}

means that a prefix of an AcGM code

c_{1}

is equal to another AcGM code

c_{2}

and a graph expressed by AcGM code c is denoted as

g (c)

.

Definition 2

(Linear ordering between AcGM codes). When two AcGM codes

α = a_{1} a_{2} \dots a_{k}

and

β = b_{1} b_{2} \dots b_{h}

fulfill either of the following conditions, we say that

β ⪯ α

:

$\exists t s . t . 1 \leq t \leq m i n (k, h)$ , $a_{q} = b_{q}$ for $q < t$ and $b_{t} ≺_{e} a_{t}$ ;
$β \subseteq α$ ;

where

≺_{e}

is a linear ordering between code fragments [34].

Example 1.

Given the graph with four vertices shown in Figure 3, we have

4! = 24

possible permutations of the four vertices. Among the permutations, according to Definition 1,

c o d e (g, ⟨ v_{2}, v_{3}, v_{4}, v_{1} ⟩)

does not generate an AcGM code because a subgraph induced by

⟨ v_{2}, v_{3} ⟩

is not connected. We can generate 12 AcGM codes from the graph and we show three AcGM codes as follows. Symbols on the left side of the following adjacent matrices are vertex IDs and the letters on the upper side of the matrices are vertex labels. In addition, letters in the matrices are edge labels and zeros in the matrices and they indicate that there are no edges. Algorithms 14 00225 i001

Figure 3. Example of a graph.

By concatenating the letters and zeros on sides of arrows in the matrices, three AcGM codes are generated as follows.

\begin{matrix} α & = & c o d e (g, ⟨ v_{1}, v_{2}, v_{3}, v_{4} ⟩) = A B a C b 0 B a 00, \\ β & = & c o d e (g, ⟨ v_{2}, v_{1}, v_{3}, v_{4} ⟩) = B A a C 0 b B 0 a 0, a n d \\ γ & = & c o d e (g, ⟨ v_{4}, v_{1}, v_{2}, v_{3} ⟩) = B A a B 0 a C 0 b 0 . \end{matrix}

The order between the codes is

α ≺ γ ≺ β

.

Lemma 1.

If two AcGM codes are the same, the graphs represented by the codes are isomorphic.

Proof.

If two adjacency matrices are the same, the graphs represented by the matrices are isomorphic. Since each AcGM code of a graph g corresponds one-to-one with an adjacency matrix of g, if two AcGM codes are the same, the graphs represented by the codes are isomorphic. □

In contrast, if two graphs are isomorphic, their AcGM codes are not necessarily the same. This means that there are multiple AcGM codes of g that fulfill Definition 1. Thus, we denote the set of all AcGM codes of graph g as

Ω (g)

.

Lemma 2.

When an AcGM code c is a prefix of another code

c^{'}

,

g (c)

is connected and is an induced subgraph of

g (c^{'})

.

Proof.

Given an AcGM code c of a graph g, let its corresponding adjacency matrix be A. All submatrices

A^{'}

of A obtained by iteratively deleting the last row and column of A represent induced and connected subgraphs of g. According to Definition 5, AcGM codes for the matrices

A^{'}

must be prefices of the AcGM code for A. □

From the above discussion, one straightforward method for outputting the subgraphs defined by Equation (4) is to compute the graph edit distance between a graph

g_{i} \in G

and a graph

g (c^{'})

, represented by a prefix

c^{'} \subseteq c

for

c \in Ω (q)

. However, computing the edit distance between

g_{i}

and

g (c^{'})

by Equation (2) is not feasible. Therefore, we discuss a method for computing the edit distance by using the codes of

g_{i}

and

g (c^{'})

.

Lemma 3.

Given the following AcGM codes:

\begin{matrix} c & = & ℓ (u_{1}) ℓ (u_{2}) x_{1, 2} \dots ℓ (u_{a}) x_{1, a} \dots x_{a - 1, a} \end{matrix}

(6)

\begin{matrix} c^{'} & = & ℓ (u_{1}^{'}) ℓ (u_{2}^{'}) x_{1, 2}^{'} \dots ℓ (u_{b}^{'}) x_{1, b}^{'} \dots x_{b - 1, b}^{'}, \end{matrix}

(7)

where a and b are the numbers of vertices in

g (c)

and

g (c^{'})

, respectively, and the i-th vertex in

g (c)

maps to the i-th vertex in

g (c^{'})

, the number of edits

e d (c, c^{'})

to transform

g (c)

to

g (c^{'})

is the following:

\begin{matrix} e d (c, c^{'}) & = & \sum_{1 \leq i \leq min {a, b}} δ^{'} (ℓ (u_{i}), ℓ (u_{i}^{'})) + \sum_{1 < j \leq min {a, b}} \sum_{1 \leq i < j} δ^{'} (x_{i, j}, x_{i, j}^{'}) \\ + | a - b | + \{\begin{matrix} \sum_{b < j \leq a} \sum_{1 \leq i < j} δ^{'} (x_{i, j}, 0) & (b < a) \\ 0 & (a = b) \\ \sum_{a < j \leq b} \sum_{1 \leq i < j} δ^{'} (0, x_{i, j}^{'}) & (a < b), \end{matrix} \end{matrix}

(8)

where δ is the Kronecker delta and

δ^{'} (α, β) = 1 - δ (α, β)

. The complexity of solving Equation (8) is

O ({(max {a, b})}^{2})

. The

e d (c, c^{'})

function is symmetric:

e d (c, c^{'}) = e d (c^{'}, c)

.

Proof.

The first term of Equation (8) is the sum of the numbers of edits that occur because of differences between the i-th (

1 \leq i \leq min {a, b}

) vertices in

g (c)

and

g (c^{'})

and its second term is the sum of the numbers of edits that occur because of differences between edges

(i, j)

(

1 \leq i < j \leq min {a, b}

) in

g (c)

and

g (c^{'})

. Its third term is the sum of the numbers of edits to insert vertices in

g (c)

or

g (c^{'})

and its fourth term is the sum of the numbers of edits to insert edges in

g (c)

or

g (c^{'})

. □

Since Equation (8) defines the number of edits between two graphs under the known mapping such that

φ (i) = i

for all

i (1 \leq i \leq max {a, b})

, it corresponds to Equation (1). Therefore, Equation (2) is formalized by using Equation (8) as the following lemmas.

Lemma 4.

Given the two AcGM codes c and

c^{'}

defined in Equations (6) and (7), we have the following.

e d (g (c), g (c^{'})) \leq e d (c, c^{'}) .

(9)

Proof.

According to Equation (2),

e d (g (c), g (c^{'})) \leq e d (c, c^{'})

is obvious. □

Lemma 5.

Given an arbitrary AcGM code

c^{'}

of a graph

g^{'}

, the graph edit distance between g and

g^{'}

is the following.

e d (g, g^{'}) \leq min_{c \in Ω (g)} e d (c, c^{'}) .

(10)

Proof.

Let

Ω^{+} (g)

be codes generated from all possible permutations of vertices in g. So

Ω^{+} (g) = {c o d e (g, ⟨ u_{1}, u_{2}, \dots, u_{| V (g) |} ⟩) ∣ ⟨ u_{1}, u_{2}, \dots, u_{| V (g) |} ⟩ \in Φ (| V (g) |, (| V (g) |)} \supseteq Ω (g)

. Some codes

c^{+}

in

Ω^{+} (g)

are not AcGM codes because a prefix of the permutation from which

c^{+}

is generated may not induce a connected subgraph of g. Since

Ω^{+} (g)

are codes generated from all possible permutations of vertices in g, we have

e d (g, g^{'}) = {min}_{c \in Ω^{+} (g)} e d (c, c^{'})

. In addition, because

Ω (g) \subseteq Ω^{+} (g)

, we have

{min}_{c \in Ω^{+} (g)} e d (c, c^{'}) \leq {min}_{c \in Ω (g)} e d (c, c^{'})

. Therefore,

e d (g, g^{'}) = min_{c \in Ω^{+} (g)} e d (c, c^{'}) \leq min_{c \in Ω (g)} e d (c, c^{'})

is satisfied. □

Example 2.

As shown in Figure 4, given two graphs g and q for which the edit distance is 3, let an arbitrary AcGM code of g be AAb.

Ω (q)

is

{A A a B 0 a, A A a B a 0, A B a A a 0, B A a A 0 a}

. When the code AAb is compared with

A A a B 0 a \in Ω (q)

, we need three edits which are indicated as × in Figure 4. By comparing AAb and all elements in

Ω (q)

, we obtain the graph edit distance between q and g according to Lemma 10.

Figure 4. Computation of Graph Edit Distance using AcGM codes.

From Lemma 5, we have the following corollary.

Collorary 1.

If

e d (c, c^{'}) \leq θ

for two AcGM codes c and

c^{'}

, then

e d (g (c), g (c^{'})) \leq θ

.

Algorithm 1 shows the pseudocode for solving the problem defined by Equation (4). In Line 3, an arbitrary AcGM code from

Ω (g_{i})

is generated for graph

g_{i}

in database G. After one of all possible AcGM codes for q is obtained in Line 4, a prefix

c^{'}

, which consists of j fragments

c_{i}

, is generated in Line 6. Since

c^{'}

is a prefix of c,

g (c^{'})

must be a connected and is an induced subgraph of q. Subsequently,

g_{i}

is added to the set of output graphs S if

e d (c_{i}, c^{'}) \leq θ

, which indicates that

g_{i}

is a solution in the set defined by Equation (4) according to Corollary 1. Once

g_{i}

is added to S, any procedures for

g_{i}

are stopped and procedures for the next graph

g_{i + 1}

are started. These steps are repeated for all of the graphs in G. The time complexity of Algorithm 1 is

O (| Ω (q) | \sum_{i = 1}^{| G |} | V (g_{i}) |^{3})

and the algorithm has some drawbacks, as follows.

(1): Although the algorithm produces $| V (g_{i}) |$ prefixes in Line 6 for each AcGM code c in $Ω (q)$ and computes Equation (8) $| V (g_{i}) |$ times in Line 7 for the prefixes, some of the repeated calculations of Equation (8) are redundant because two prefixes $c^{'}$ and $c^{″}$ produced from c satisfy $c^{'} \subset c^{''}$ . Using the result of computing $e d (c_{i}, c^{'})$ to compute $e d (c_{i}, c^{″})$ would render Algorithm 1 efficient.
(2): Let $G_{c} = {c o d e (g_{i}) ∣ g_{i} \in G}$ be the set of codes produced in Line 3. AcGM codes $c o d e (g_{i})$ and $c o d e (g_{j})$ , for which their prefixes are the same, are included in $G_{c}$ . The repeated calculations of Equation (8) between these AcGM codes and $c^{'}$ are redundant. If the common prefix of $c o d e (g_{i})$ and $c o d e (g_{j})$ is $c_{s}$ , using the result of computing $e d (c_{s}, c^{'})$ to compute $e d (c o d e (g_{i}), c^{'})$ and $e d (c o d e (g_{j}), c^{'})$ would make Algorithm 1 efficient.

Algorithm 1: Straightforward Algorithm for Searching (4)

5. Method for Traversing Prefix Tree for Constrained Problem

In order to overcome drawback (1), described in the previous section, we introduce the following lemma.

Lemma 6.

Given two AcGM codes c and

c^{'}

, we have the following:

e d (p r e (c, i), p r e (c^{'}, i)) \leq e d (p r e (c, j), p r e (c^{'}, j)) f o r i < j .

(11)

In the case that i is greater than the number of fragments in c, we assume that

p r e (c, i)

is equivalent to c itself.

Proof.

e d (p r e (c, i + 1), p r e (c^{'}, i + 1))

is the sum of

e d (p r e (c, i), p r e (c^{'}, i))

and the number of edits between the

i + 1

-th fragments of

p r e (c, i + 1)

and

p r e (c^{'}, i + 1))

. Since the number of edits is non-negative,

e d (p r e (c, i), p r e (c^{'}, i))

increases monotonically when i is increased. Therefore, for

i < j

, we have

e d (p r e (c, i), p r e (c^{'}, i)) \leq e d (p r e (c, j), p r e (c^{'}, j))

. □

Collorary 2.

According to Lemma 6, if

e d (p r e (c, i), p r e (c^{'}, i)) > θ

, then

e d (c, c^{'}) > θ

.

As a consequence of Lemma 6 and Corollary 11, when

e d (p r e (c, i), p r e (c^{'}, i)) > θ

, we can stop the computation in Line 7 of Algorithm 1.

Lemma 7.

Given the two following fragments:

\begin{matrix} s & = & ℓ (u_{i}) x_{1, i} \dots x_{i - 1, i} \end{matrix}

(12)

\begin{matrix} s^{'} & = & ℓ (u_{i}^{'}) x_{1, i}^{'} \dots x_{i - 1, i}^{'}, \end{matrix}

(13)

for which their lengths are equal, the number of edits

e d (s, s^{'})

to equalize the fragments is as follows:

e d (s, s^{'}) = \{\begin{matrix} 1 + \sum_{1 \leq j < i} δ^{'} (0, x_{j, i}^{'}) & (s = n u l l) \\ 1 + \sum_{1 \leq j < i} δ^{'} (x_{j, i}, 0) & (s^{'} = n u l l) \\ δ^{'} (ℓ (u_{i}), ℓ (u_{i}^{'})) + \sum_{1 \leq j < i} δ^{'} (x_{j, i}, x_{j, i}^{'}) & (o t h e r w i s e) . \end{matrix}

(14)

Since

e d (s, s^{'})

is a symmetric function,

e d (s, s^{'}) = e d (s^{'}, s)

.

Proof.

Please see the proof for Lemma 3. □

Algorithm 2 shows the pseudocode for solving the problem defined by (4), which overcomes the aforementioned drawback (1). After obtaining the j-th fragments of AcGM codes

c_{i}

and c in Line 7 of the algorithm, the number of edits between the fragments is calculated according to Lemma 7. The number of edits is then added to the accumulated edit count

a e

. According to Corollary 2, the algorithm stops the calculation of the number of edits between

c_{i}

and one of the AcGM codes of q in Line 8. However, because

g_{i}

is a member of the set defined in Equation (4) if the edited distance between

c_{i}

and at least one AcGM code in

Ω (q)

is less than

θ

, these steps are repeated for all AcGM codes of q. The time complexity of Algorithm 2 is

O (| Ω (q) | \sum_{i = 1} | G | | V (g_{i}) |^{2})

.

Algorithm 2: Straightforward Algorithm 2 for Searching (4)

In order to overcome drawback (2), described in the previous section, we use a prefix tree for the AcGM codes for G.

Definition 3

(Code Tree [17] (The detailed algorithm for constructing a code tree from a graph database G is given in [17].)). A code tree T is defined as the triplet

(⊤, N, B)

, where

⊤ \in N

is the root node of the tree, N is a set of the nodes of the tree, and B is a set of the branches of the tree. In addition, each node is associated with a code fragment and a set of graph identifiers. Let

s (n)

be the concatenation of the fragments associated with nodes on the path from the root to node n. When one of the AcGM codes of

g_{i} \in G

is the same as

s (n)

, the set of graph identifiers for node n contains i, which is the identifier of graph

g_{i} \in G

.

The code fragment and graph identifiers associated with node n are denoted as

f r (n)

and

I D (n)

, respectively. For the root node,

f r (⊤) = n u l l

and

I D (⊤) = \emptyset

. We present an example of a code tree below.

Example 3.

A graph database

G = {g_{1}, g_{2}, g_{3}, g_{4}}

consists of four graphs, as shown in Figure 5. When one of the AcGM codes for each graph in G is generated, the AcGM codes for the four graphs are as follows.

\begin{matrix} c o d e (g_{1}, ⟨ v_{1}, v_{2}, v_{3} ⟩) & = & A B a C 0 b, \\ c o d e (g_{2}, ⟨ v_{1}, v_{2}, v_{3} ⟩) & = & A B a D 0 b, \\ c o d e (g_{3}, ⟨ v_{1}, v_{2} ⟩) & = & B B a, a n d \\ c o d e (g_{4}, ⟨ v_{1}, v_{2}, v_{3} ⟩) & = & B B a C 0 b . \end{matrix}

Figure 5. Code tree.

From these AcGM codes, a code tree is constructed, as shown in the right part of Figure 5. Node

n_{6}

of the tree is associated with the code fragment

C 0 b

and the set of graph identifiers

I D (n) = {1}

. The concatenation of fragments associated with nodes on the path from the root to node

n_{6}

is

A B a C 0 b

, which represents graph

g_{1}

in G.

Algorithm 3 shows the pseudocode for searching for members of the set defined in Equation (4) by traversing the code tree. At node n, Algorithm 3 adds vertex w of q to

⟨ w_{1}, w_{2}, \dots, w_{h} ⟩

, from which the fragment s is generated, with the addition of the prefix

c o d e (q, ⟨ w_{1}, w_{2}, \dots, w_{h} ⟩)

of an AcGM code of q. The algorithm then recursively traverses the children of n that have fragments similar to fragment s. In contrast to Algorithm 2, which iteratively checks whether each individual graph in the graph database is a solution, Algorithm 3 checks whether multiple graphs in the database are simultaneously solutions. This is because each node in the code tree is associated with the common prefix of the AcGM codes of multiple graphs in the database. This simultaneous checking renders our search very efficient. The worst case time complexity of Algorithm 3 is the same as the time complexity of Algorithm 2. When AcGM codes of graphs in the database do not have common prefixes as one another, the time complexity of Algorithm 3 becomes worst. However, the codes usually have common prefixes as one another and the computation time that this algorithm needs is proportional to the number of nodes that Algorithm 3 traverses. In addition, at each node, Algorithm 3 needs

O (h | V (q) | | N |)

.

Algorithm 3: Code Tree Search for Finding Solutions to Equation (4)

6. Method for Traversing Prefix Tree for Original Problem

At node n, for which its depth is h in the code tree, Algorithm 3 retains the accumulated edit count

a e

for two concatenations c and

c^{'}

of fragments of length h. c and

c^{'}

have the following restrictions:

c is a concatenation of fragments associated with nodes on the path from the root of the code tree to n and is an AcGM code of a connected and induced graph that is a common subgraph of multiple graphs in the graph database;
$c^{'}$ is a prefix of one of the AcGM codes in $Ω (q)$ . According to the definition of the AcGM code, $g (c^{'})$ must be connected and is an induced subgraph of q.

In order to solve the original problem for searching for solutions to Equation (3), we relax the restrictions as follows. The restriction that

g (c^{'})

is connected is a consequence of the fact that

c o d e (q, ⟨ w_{1}, w_{2}, \dots, w_{h} ⟩)

is a prefix of the AcGM codes

Ω (q)

of q. Therefore, in Line 4 of Algorithm 3., we relax this restriction by allowing

⟨ w_{1}, w_{2}, \dots, w_{h} ⟩

to be all possible h-permutations of

v e r t i c e s {v_{1}, v_{2}, \dots, v_{| V (q) |}}

in q. Codes

c o d e (q, ⟨ w_{1}, w_{2}, \dots, w_{h} ⟩)

by some of the permutations correspond to unconnected graphs.

Next, we relax the restriction that

g (c^{'})

is an induced subgraph of q. Algorithm 1 produces prefixes of the AcGM codes of q to obtain connected and induced subgraphs of q. Suppose that we would like to obtain, in addition to the connected and induced subgraphs, all possible unconnected and non-induced subgraphs of q. In this case, we may need, in addition to the prefixes of the AcGM codes of q, all possible codes generated by replacing some elements

x_{i, j}

in the prefixes by 0. This requires us to solve both the permutation problem among the vertices of q and the combination problem among the edges of q. However, we do not need to solve the latter problem for the following reasons. We consider a graph q, its induced subgraph

q^{'}

, and two graphs

g_{1}

and

g_{2}

, shown in Figure 6. We refer to two vertices in

q^{'}

as

v_{i}

and

v_{j}

.

Figure 6. Inclusion of graphs and similarity.

$e d (q^{'}, g_{1}) = 2$ . Given a graph $q^{″}$ obtained by removing an edge from $q^{'}$ , we have $e d (q^{'}, g_{1}) < e d (q^{″}, g_{1}) = 3$ . In the problem of finding solutions to Equation (3), we check whether there exists a subgraph of q such that $e d (q^{'}, g_{i}) \leq θ$ and this subgraph need not be an induced subgraph of q. Therefore, in the case that there is an edge between two vertices $v_{i}$ and $v_{j}$ in $q^{'}$ and there is also an edge in $g_{i}$ between the two corresponding vertices, we do not need to consider the graph $q^{″}$ obtained by removing an edge from $q^{'}$ . That is, we do not need to replace any elements $x_{i, j}$ in prefixes of the AcGM codes of $q^{'}$ by 0.
$e d (q^{'}, g_{2}) = 1$ . For the above $q^{″}$ , we have $e d (q^{'}, g_{2}) > e d (q^{″}, g_{2}) = 0$ . Therefore, in the case that there is an edge between two vertices $v_{i}$ and $v_{j}$ in $q^{'}$ and there is no edge in $g_{i}$ between the two corresponding vertices, we do not need to consider the graph that does not have an edge corresponding to edge $(v_{i}, v_{j})$ in $q^{'}$ .
In all other cases, there is no edge between $v_{i}$ and $v_{j}$ in $q^{'}$ . Since $x_{i, j} = 0$ , we do not need to replace $x_{i, j}$ by 0.

From the above discussion, we rewrite Lemma 7 as follows.

Lemma 8.

e d (s, s^{'})

= \{\begin{matrix} 1 + \sum_{1 \leq j < i} δ^{'} (x_{j, i}, 0) & (s^{'} = n u l l) \\ 1 & (s = n u l l) \\ δ^{'} (ℓ (v_{i}), ℓ (v_{i}^{'})) + \sum_{\begin{matrix} 1 \leq j < i \\ x_{i, j} \neq 0 \end{matrix}} δ^{'} (x_{j, i}, x_{j, i}^{'}) & (s \neq n u l l \land s^{'} \neq n u l l) . \end{matrix}

(15)

This function is asymmetric.

Since we consider all possible subgraphs of q by the function defined in Equation (8), we need to produce only the prefixes underlined above.

Algorithm 4 shows the pseudocode for searching for members of the set defined in Equation (3) by traversing the code tree. The part from Lines 4–9 searches

{g_{i} \in G ∣ q^{'} \subseteq q s . t . e d (q^{'}, g_{i}) \leq θ, | V (g_{i}) | \leq | V (q^{'}) |}

, whereas the part from Lines 10–13 searches

{g_{i} \in G ∣ q^{'} \subseteq q s . t . e d (q^{'}, g_{i}) \leq θ, | V (g_{i}) | > | V (q^{'}) |}

. The other difference between Algorithm 3 and Algorithm 4 is Line 4, which produces codes from one of all possible h-permutations of vertices of the query graph q. The codes produced are not limited to AcGM codes, but are AGM codes [24]. In Lines 7 and 11, Equation (15) is used to calculate the similarity between two fragments. Since the codes produced in Line 4 are limited to prefixes of the AGM codes of q, we do not need to solve the combination problem among the edges of q, which enables our proposed method to be very efficient. The time complexity of Algorithm 4 is the same as one of Algorithm 3; nevertheless, the problem that Algorithm 3 solves is a constrained problem of the problem that Algorithm 4 solves.

Algorithm 4: Code Tree Search for Finding Solutions to Equation (3)

7. Customization of Solutions

In the previous section, we proposed a method for searching for the set of graphs S defined by Equation (3). The number of graphs reachable by editing graph

q^{'}

θ

times in Equation (3) increases exponentially as

θ

is increased. Therefore, for a large

θ

, the computation time required to search for the graphs in S becomes excessively long. However, some real applications need to obtain the following sets of constrained graphs rather than obtaining all the graphs in S.

$⟨ 1 ⟩$: Editing graphs is constrained: for example, the relabeling of vertices or edges is admissible in $e d (q^{'}, g_{i})$ of Equation (3), whereas insertions and deletions are not admissible. That is, when converting a labeled graph g to an unlabeled graph $u n (g)$ by removing label information from vertices and edges in g, $u n (g_{i})$ and $u n (q^{'})$ in Equation (3) are isomorphic.
$⟨ 2 ⟩$: Editing graphs is constrained: for example, editing some specific vertices and edges in a query graph q is admissible, whereas editing other vertices and edges is not admissible. For example, in the substructure drawn with blue lines and labels in Figure 1, editing its ring structure is admissible, whereas editing the remainder of the blue substructure is not admissible.

The problem for the above constraint

⟨ 1 ⟩

is formalized as follows.

{g_{i} \in G ∣ \exists q^{'} \subseteq q s . t . e d (g_{i}, q^{'}) \leq θ \land u n (g_{i}) = u n (q^{'})} .

(16)

In order to solve this problem, Equation (15) is rewritten as follows.

\begin{matrix} e d (s, s^{'}) \\ = & \{\begin{matrix} δ^{'} (ℓ (v_{i}), ℓ (v_{i}^{'})) + \sum_{\begin{matrix} 1 \leq j < i \\ \land x_{i, j} \neq 0 \\ \land x_{i, j}^{'} \neq 0 \end{matrix}} δ^{'} (x_{j, i}, x_{j, i}^{'}) + \sum_{\begin{matrix} 1 \leq j < i \\ \land (x_{i, j} = 0 \\ \lor x_{i, j}^{'} = 0) \end{matrix}} \infty & (s \neq n u l l \land s^{'} \neq n u l l) \\ \infty & (o t h e r w i s e) . \end{matrix} \end{matrix}

(17)

For the case of

s \neq n u l l \land s^{'} \neq n u l l

, the first and second terms in Equation (17) are the numbers of edits that occur by relabeling the i-th vertices and edges

(i, j) (1 \leq j < i)

, respectively, in

g \in G

and

q^{'}

. The third term of Equation (17) is the sum of the numbers of edits that occur by inserting or deleting edges. For this case, in containing such insertion or deletion operations, the

e d (s, s^{'})

function returns values larger than

θ

and Algorithm 4 (in Line 7) backtracks the code tree.

For the case of the above constraint

⟨ 2 ⟩

, let some specific vertices in query graph q be

V_{q} \subseteq V (q)

. When Algorithm 4 visits node n in the code tree, it generates various codes by

⟨ w_{1}, w_{2}, \dots, w_{h}, w ⟩

, in Line 4. Some vertices in

⟨ w_{1}, w_{2}, \dots, w_{h}, w ⟩

are included in

V_{q}

, but others are not. In this case, Equation (15) is rewritten as follows.

\begin{matrix} e d (s, s^{'}) \\ = & \{\begin{matrix} δ^{'} (ℓ (v_{i}), ℓ (v_{i}^{'})) + \sum_{\begin{matrix} w_{j} \in V_{q} \\ \land 1 \leq j < i \\ \land x_{i, j} \neq 0 \end{matrix}} δ^{'} (x_{j, i}, x_{j, i}^{'}) + \sum_{\begin{matrix} w_{j} \notin V_{q} \\ \land 1 \leq j < i \\ \land x_{i, j} \neq 0 \end{matrix}} \infty & (s, s^{'} \neq n u l l \land w_{i} \in V_{q}) \\ \infty & (o t h e r w i s e) . \end{matrix} \end{matrix}

(18)

Thus, for the case of

s \neq n u l l \land s^{'} \neq n u l l \land w_{i} \in V_{q}

, the first and second terms of Equation (18) are the numbers of edits that occur by relabeling the i-th vertices and edges

(i, j) (1 \leq j < i)

, respectively, in

g \in G

and

q^{'}

, where

w_{i}

and

w_{j}

are elements in

V_{q}

. The third term of Equation (18) is the sum of the numbers of edits that occur by editing the other vertices and edges.

In order to summarize the above discussion, the advantages of our proposed method are the following:

Users can search for their desired type of similar graphs in a database containing graphs;
This search can be realized by simply rewriting Equation (15), without the need to modify Algorithm 4.

Therefore, the method proposed in this paper is a general and flexible framework for searching for similar subgraphs of query graphs and is easily customizable for searching for the types of graphs desired by the user.

8. Experimental Evaluation

For our experimental evaluation, we used the three real-world datasets used in [18] which were downloadable from https://github.com/SNUCSE-CTA/IDAR on 1 March 2021. The datasets consisting of chemical compounds are called AIDS, NCI, and PubChem. Each atom, chemical bond, atom type, and bond type in a chemical compound corresponds to a vertex, edge, vertex label, and edge label, respectively. Sets of queries and databases containing graphs were generated from each of the datasets. One hundred query graphs, each containing more than 100 vertices, were selected from each of the datasets. Each generated database contained between 10,000 and 100,000 graphs and each of the graphs was generated by one of two different methods [18]. In the first method, each graph in a database consisted of vertices and edges on a random walk on a graph randomly selected from one of the datasets. This type of database is called rand. In the second method, each graph in a database was a frequent subgraph with a minimum support value of 0.1% [24] that is enumerated from one of the datasets. This type of database is called freq. In the experiments, vertices corresponding to hydrogen were removed, which is similar to the experiments in [2,14,17,30]. Table 2 summaries graphs in databses and queries in our experiments.

Table 2. Summary of datasets used in our experiments.

Since the problem of similar supergraph search is a novel problem proposed in this paper, there are no methods for solving this problem other than our proposed method to the best of our knowledge. Although we cannot compare our proposed method with other methods, our method is based on CodeTree (We use “CodeTree” to name the method proposed in [17] and “code tree” to represent the index used in CodeTree). This is proposed in [17] and our method with

θ = 0

is equivalent to CodeTree. Our proposed method was implemented in Java and the experiments reported in this section were conducted on a workstation with an AMD Ryzen Threadripper 3970X 3.7 GHz CPU and 64 GB main memory.

First, we conducted experiments for Algorithm 4 with Equation (17). Figure 7, Figure 8 and Figure 9 show the average computation time (query processing time) t, the average number of nodes n in the code tree that our method traversed, and the average number of solutions

| S |

, respectively, for various numbers of graphs

| G |

in a database and various thresholds

θ

. The computation time increased as the number of graphs in a database increased, as shown in Figure 7. Although the time complexity of Algorithms 1 and 2 is linearly related to the number of graphs in the database, the computation time of our method was sublinear for the number of graphs in the database because Algorithm 4 maintains the correspondence between a vertex of a query graph and vertices of multiple graphs in the database simultaneously. In contrast, the computation time increased exponentially as

θ

was increased, as shown in Figure 7. This is because the number of graphs reachable by editing a graph

q^{'} \subseteq q

θ

times increases exponentially. Although the computation time increased as the database size increased for freq databases, the rate of increase was less than for rand databases. This is because the numbers of nodes in code trees for freq databases are smaller than those for rand databases and the former numbers are relatively unchanged by the increase in database size, as shown in Figure 10. The reason why the numbers of nodes in code trees are largely unaffected by database size is that frequent subgraphs enumerated from the datasets are similar to each other and the AcGM codes of similar frequent subgraphs have common prefixes. This is because a subgraph of a frequent subgraph is also a frequent subgraph, which is a consequence of the anti-monotonic property of the support value [24].

Figure 7. Average computation times t for various numbers of graphs

| G |

in a database and various thresholds

θ

.

Figure 8. Average numbers of nodes n in the code tree that our method traversed for various numbers of graphs

| G |

in a database and various thresholds

θ

.

Figure 9. Number of solutions

| S |

for various numbers of graphs

| G |

in a database and various thresholds

θ

.

Figure 10. Number of nodes in code trees.

As shown in Figure 8, the number of nodes in the code tree that Algorithm 4 traversed also increased exponentially as

θ

increased. The shapes of the curves plotted in Figure 8 are similar to those in Figure 7 and the trend for the numbers

| G |

of graphs in databases and thresholds

θ

in Figure 8 are also similar to those in Figure 7. Figure 11 shows

t / n

for various numbers of graphs

| G |

in a database and various thresholds

θ

. The figure shows that

t / n

is almost constant and unaffected by the number of graphs in a database, which indicates that the computation time of our method is proportional to the number of nodes in the code tree that Algorithm 4 traverses.

Figure 11. Computation time per node

t / n

for various numbers of graphs

| G |

in a database and various thresholds

θ

.

As shown in Figure 9, the number of solutions found by Algorithm 4 was proportional to the number of graphs in the databases. Because there is a limit to the number of graphs in the database, the number of solutions did not increase exponentially as

| G |

increased. The rectangle in Figure 12 shows the space in which graphs in a database are distributed and each point in the rectangle shows a graph in the database. Points inside the circle are solutions for a given query q and threshold

θ

. If the number of graphs in database G is doubled to

G^{'}

, such that

2 | G | = | G^{'} |

, the number of points surrounded by the circle doubles, if the distribution is not changed such that

P (G) = P (G^{'})

. Thus, the number of solutions found by Algorithm 4 is proportional to the numbers of graphs in the databases.

Figure 12. Distribution of graphs in a database.

Figure 13 shows

t / | S |

for various numbers of graphs

| G |

in a database and various thresholds

θ

; the figure indicates the average computation time to output each solution. As the numbers of graphs in the databases increased,

t / | S |

decreased exponentially. This is because

| S |

is proportional to the number of graphs in a database, whereas the overall computation time to search for solutions is sublinear.

Figure 13. Computation time per solution

t / | S |

for various numbers of graphs

| G |

in a database and various thresholds

θ

.

Figure 14 shows that average computation time and the number of nodes that Algorithm 4 traversed for various average numbers of vertices in graphs in a database and various thresholds

θ

. When the average number of vertices in graphs in a database is increased, the average computation time of Algorithm 4 is largely unaltered because the computation time of Algorithm 4 is proportional to the number of nodes that Algorithm 4 traverses, as mentioned after Algorithm 3. The number is also largely unaltered with respect to the change of the average number of the vertices. The reason why the number of nodes that Algorithm 4 traversed is largely unaltered is because of the use of the function

e d (s, s^{'})

shown in Equation (17). By using Equation (17), the relabeling of vertices or edges is admissible, whereas insertions and deletions of the vertices and edges are not admissible. Therefore, nodes that Algorithm 4 traverses are restricted. As mentioned in Section 7, one of the advantages of our proposed method is the customizability for searching for the types of graphs desired by users. By customizing the search for the types of graphs, the users can reduce not only the number of solutions to be outputted but also computation time of our proposed method.

Figure 14. Average computation time and the average number of nodes that Algorithm 4 traversed for various numbers of vertices in graphs in a database and various thresholds

θ

.

Next, we conducted experiments for Algorithm 4 with Equations (15) and (17). Figure 15 shows the computation times for various numbers of graphs in the databases and various thresholds. The computation times with Equation (17) were much shorter than those with Equation (15). This is because the number of graphs reachable by editing a graph

q^{'} \subseteq q

θ

times increases exponentially and editing graphs with Equation (17) is limited to the relabeling of vertices and edges, which is faster than the insertion and deletion of vertices and edges that are required with Equation (15).

Figure 15. Computation times of Algorithm 4 with Equation (15) and Equation (17) for various numbers of graphs

| G |

in a database and various thresholds

θ

.

9. Conclusions

In this paper, we proposed a novel problem, which is called similar supergraph search, and designed an efficient algorithm to solve the problem. The problem is to identify all graphs in a database that are similar to any subgraph of a query graph, where similarity is defined as the edit distance. The proposed algorithm has three advantages for searching a database consisting of graphs for similar graphs. The first advantage is that the algorithm checks whether multiple graphs in the database are simultaneously solutions, because each node in the code tree is associated with the common prefix of the AcGM codes of multiple graphs in the database. The second advantage is that the codes produced by our proposed algorithm are limited to prefixes of the AGM codes of q and we do not need to solve the combination problem among the edges of q. The third advantage is the customization that enables users to search for their desired type of similar graphs in a database containing graphs. Therefore, the algorithm is a general and flexible framework for searching for similar subgraphs of query graphs and it is easily customizable for searching for the types of graphs desired by the user. Our experiments showed that the computation time increased exponentially as the distance threshold increased, but increased sublinearly with the number of graphs in the database.

Author Contributions

Conceptualization, A.I.; methodology, A.I.; software, M.Y. and A.I.; validation, M.Y. and A.I.; writing—original draft, A.I.; writing—review and editing, M.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by JSPS KAKENHI Grant Number JP20K11835.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not available.

Conflicts of Interest

The authors declare no conflict of interest.

References

Shiraki, K. Characteristics of a candidate of an antiviral medication against COVID-19. Jpn. Med. J. 2020, 5005, 25–31. (In Japanese) [Google Scholar]
Bonnici, V.; Ferro, A.; Giugno, R.; Pulvirenti, A.; Shasha, D.E. Enhancing Graph Database Indexing by Suffix Tree Structure. In Proceedings of the IAPR International Conference on Pattern Recognition in Bioinformatics, Nijmegen, The Netherlands, 22–24 September 2010; pp. 195–203. [Google Scholar]
Cheng, J.; Ke, Y.; Ng, W.; Lu, A. FG-Index: Towards Verification-Free Query Processing on Graph Databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Beijing, China, 11–14 June 2007; pp. 857–872. [Google Scholar]
Cheng, J.; Ke, Y.; Ng, W. Efficient Query Processing on Graph Databases. ACM Trans. Database Syst. 2009, 2, 48. [Google Scholar] [CrossRef]
Klein, K.; Kriege, N.M.; Mutzel, P. CT-Index: Fingerprint-based Graph Indexing Combining Cycles and Trees. In Proceedings of the IEEE International Conference on Data Engineering, Hannover, Germany, 11–16 April 2011; pp. 1115–1126. [Google Scholar]
Shang, H.; Zhang, Y.; Lin, X.; Yu, J.X. Taming Verification Hardness: An Efficient Algorithm for Testing Subgraph Isomorphism. Proc. Vldb Endow. 2008, 1, 364–375. [Google Scholar] [CrossRef]
Sun, S.; Luo, Q. Scaling Up Subgraph Query Processing with Efficient Subgraph Matching. In Proceedings of the IEEE International Conference on Data Engineering, Paris, France, 16–19 April 2019; pp. 220–231. [Google Scholar]
Williams, D.W.; Huan, J.; Wang, W. Graph Database Indexing Using Structured Graph Decomposition. In Proceedings of the IEEE International Conference on Data Engineering, Istanbul, Turkey, 17–20 April 2007; pp. 976–985. [Google Scholar]
Xie, Y.; Yu, P.S. CP-Index: On the Efficient Indexing of Large Graphs. In Proceedings of the ACM Conference on Information and Knowledge Management, Glasgow, UK, 24–28 October 2011; pp. 1795–1804. [Google Scholar]
Yan, X.; Yu, P.S.; Han, J. Graph Indexing: A Frequent Structure-based Approach. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Paris, France, 13–18 June 2004; pp. 335–346. [Google Scholar]
Yuan, D.; Mitra, P. Lindex: A Lattice-based Index for Graph Databases. VLDB J. 2013, 22, 229–252. [Google Scholar] [CrossRef]
Zhang, S.; Hu, M.; Yang, J. TreePi: A Novel Graph Indexing Method. In Proceedings of the IEEE International Conference on Data Engineering, Istanbul, Turkey, 17–20 April 2007; pp. 966–975. [Google Scholar]
Zhao, P.; Yu, J.X.; Yu, P.S. Graph Indexing: Tree + Delta >= Graph. In Proceedings of the International Conference on Very Large Data Bases, Vienna, Austria, 23–27 September 2007; pp. 938–949. [Google Scholar]
Zou, L.; Chen, L.; Yu, J.X.; Lu, Y. A Novel Spectral Coding in a Large Graph Database. In Proceedings of the International Conference on Extending Database Technology, Nantes, France, 25–29 March 2008; pp. 181–192. [Google Scholar]
Chen, C.; Yan, X.; Yu, P.S.; Han, J.; Zhang, D.; Gu, X. Towards Graph Containment Search and Indexing. In Proceedings of the International Conference on Very Large Data Bases, Vienna, Austria, 23–27 September 2007; pp. 926–937. [Google Scholar]
Cheng, J.; Ke, Y.; Fu, A.W.; Yu, J.X. Fast Graph Query Processing with a Low-Cost Index. VLDB J. 2011, 20, 521–539. [Google Scholar] [CrossRef][Green Version]
Imai, S.; Inokuchi, A. Efficient Supergraph Search Using Graph Coding. IEICE Trans. Inf. Syst. 2020, 103-D, 130–141. [Google Scholar] [CrossRef]
Kim, H.; Min, S.; Park, K.; Lin, X.; Hong, S.; Han, W. IDAR: Fast Supergraph Search Using DAG Integration. Proc. Vldb Endow. 2020, 13, 1456–1468. [Google Scholar] [CrossRef]
Lyu, B.; Qin, L.; Lin, X.; Chang, L.; Yu, J.X. Scalable Supergraph Search in Large Graph Databases. In Proceedings of the IEEE International Conference on Data Engineering, Helsinki, Finland, 16–20 May 2016; pp. 157–168. [Google Scholar]
Yuan, D.; Mitra, P.; Giles, C.L. Mining and Indexing Graphs for Supergraph Search. Proc. Vldb Endow. 2013, 6, 829–840. [Google Scholar] [CrossRef]
Zhang, S.; Li, J.; Gao, H.; Zou, Z. A Novel Approach for Efficient Supergraph Query Processing on Graph Databases. In Proceedings of the International Conference on Extending Database Technology, Saint-Petersburg, Russia, 24–26 March 2009; pp. 204–215. [Google Scholar]
Zhu, G.; Lin, X.; Zhang, W.; Wang, W.; Shang, H. PrefIndex: An Efficient Supergraph Containment Search Technique. In Proceedings of the International Conference on Scientific and Statistical Database Management, Heidelberg, Germany, 30 June–2 July 2010; pp. 360–378. [Google Scholar]
Riesen, K. Structural Pattern Recognition with Graph Edit Distance—Approximation Algorithms and Applications. In Advances in Computer Vision and Pattern Recognition; Springer: Berlin, Germany, 2015. [Google Scholar]
Inokuchi, A.; Washio, T.; Motoda, H. An Apriori-based Algorithm for Mining Frequent Substructures from Graph Data. In Proceedings of the European Conference on Principles of Data Mining and Knowledge Discovery, Lyon, France, 13–16 September 2000; pp. 13–23. [Google Scholar]
Chang, L.; Feng, X.; Lin, X.; Qin, L.; Zhang, W.; Ouyang, D. Speeding Up GED Verification for Graph Similarity Search. In Proceedings of the IEEE International Conference on Data Engineering, Dallas, TX, USA, 20–24 April 2020; pp. 793–804. [Google Scholar]
Gouda, K.; Hassaan, M. CS_GED: An Efficient Approach for Graph Edit Similarity Computation. In Proceedings of the IEEE International Conference on Data Engineering, Helsinki, Finland, 16–20 May 2016; pp. 265–276. [Google Scholar]
Kim, J.; Choi, D.; Li, C. Inves: Incremental Partitioning-based Verification for Graph Similarity Search. In Proceedings of the International Conference on Extending Database Technology, Lisbon, Portugal, 26–29 March 2019; pp. 229–240. [Google Scholar]
Liang, Y.; Zhao, P. Similarity Search in Graph Databases: A Multi-Layered Indexing Approach. In Proceedings of the IEEE International Conference on Data Engineering, San Diego, CA, USA, 19–22 April 2017; pp. 783–794. [Google Scholar]
Wang, X.; Ding, X.; Tung, A.K.H.; Ying, S.; Jin, H. An Efficient Graph Indexing Method. In Proceedings of the IEEE International Conference on Data Engineering, Arlington, VA, USA, 1–5 April 2012; pp. 210–222. [Google Scholar]
Zhao, X.; Xiao, C.; Lin, X.; Wang, W.; Ishikawa, Y. Efficient Processing of Graph Similarity Queries with Edit Distance Constraints. VLDB J. 2013, 22, 727–752. [Google Scholar] [CrossRef]
Zhao, X.; Xiao, C.; Lin, X.; Zhang, W.; Wang, Y. Efficient Structure Similarity Searches: A Partition-based Approach. VLDB J. 2018, 27, 53–78. [Google Scholar] [CrossRef]
Zheng, W.; Zou, L.; Lian, X.; Wang, D.; Zhao, D. Efficient Graph Similarity Search Over Large Graph Databases. IEEE Trans. Knowl. Data Eng. 2015, 27, 964–978. [Google Scholar] [CrossRef]
Inokuchi, A.; Washio, T.; Nishimura, Y.; Motoda, H. A Fast Algorithm for Mining Frequent Connected Subgraphs; IBM Research: Yorktown Heights, NY, USA, 2002. [Google Scholar]
Yan, X.; Han, J. gSpan: Graph-Based Substructure Pattern Mining. In Proceedings of the IEEE International Conference on Data Mining, Maebashi City, Japan, 9–12 December 2002; pp. 721–724. [Google Scholar]
Bi, F.; Chang, L.; Lin, X.; Qin, L.; Zhang, W. Efficient Subgraph Matching by Postponing Cartesian Products. In Proceedings of the International Conference on Management of Data, San Francisco, CA, USA, 26 June–1 July 2016; pp. 1199–1214. [Google Scholar]
Sun, Z.; Wang, H.; Wang, H.; Shao, B.; Li, J. Efficient Subgraph Matching on Billion Node Graphs. Proc. Vldb Endow. 2012, 5, 788–799. [Google Scholar] [CrossRef]
Zhang, S.; Li, S.; Yang, J. GADDI: Distance Index based Subgraph Matching in Biological Networks. In Proceedings of the International Conference on Extending Database Technology, Saint Petersburg, Russia, 24–26 March 2009; pp. 192–203. [Google Scholar]
Khan, A.; Li, N.; Yan, X.; Guan, Z.; Chakraborty, S.; Tao, S. Neighborhood based Fast Graph Search in Large Networks. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Athens, Greece, 12–16 June 2011; pp. 901–912. [Google Scholar]
Khan, A.; Wu, Y.; Aggarwal, C.C.; Yan, X. NeMa: Fast Graph Search with Label Similarity. Proc. Vldb Endow. 2013, 181–192. [Google Scholar] [CrossRef]
Tian, Y.; McEachin, R.C.; Santos, C.; States, D.J.; Patel, J.M. SAGA: A Subgraph Matching Tool for Biological Graphs. Bioinformatics 2007, 23, 232–239. [Google Scholar] [CrossRef] [PubMed]
Zhang, S.; Yang, J.; Jin, W. SAPPER: Subgraph Indexing and Approximate Matching in Large Graphs. Proc. Vldb Endow. 2010, 3, 1185–1194. [Google Scholar] [CrossRef]
Borgwardt, K.M.; Ghisu, M.E.; Llinares-López, F.; O’Bray, L.; Rieck, B. Graph Kernels: State-of-the-Art and Future Challenges. Found. Trends Mach. Learn. 2002, 13, 531–712. [Google Scholar] [CrossRef]
Wang, X.; Smalter, A.M.; Huan, J.; Lushington, G.H. G-Hash: Towards Fast Kernel-based Similarity Search in Large Graph Databases. In Proceedings of the International Conference on Extending Database Technology, Nantes, France, 25–29 March 2008; pp. 472–480. [Google Scholar]
Raymond, J.W.; Willett, P. Maximum Common Subgraph Isomorphism Algorithms for the Matching of Chemical Structures. J. Comput. Aided Mol. Des. 2002, 16, 521–533. [Google Scholar] [CrossRef] [PubMed]
Bahiense, L.; Manic, G.; Piva, B.; de Souza, C.C. The Maximum Common Edge Subgraph Problem: A Polyhedral Investigation. Discret. Appl. Math. 2012, 160, 2523–2541. [Google Scholar] [CrossRef]
Kashima, H.; Tsuda, K.; Inokuchi, A. Marginalized Kernels Between Labeled Graphs. In Proceedings of the International Conference on Machine Learning, Washington, DC, USA, 21–24 August 2003; pp. 321–328. [Google Scholar]
Shervashidze, N.; Schweitzer, P.; van Leeuwen, E.J.; Mehlhorn, K.; Borgwardt, K.M. Weisfeiler-Lehman Graph Kernels. J. Mach. Learn. Res. 2011, 12, 2539–2561. [Google Scholar]

Figure 1. Similar Chemical Compounds.

Figure 2. Example of supergraph search.

Figure 3. Example of a graph.

Figure 4. Computation of Graph Edit Distance using AcGM codes.

Figure 5. Code tree.

Figure 6. Inclusion of graphs and similarity.

Figure 7. Average computation times t for various numbers of graphs

| G |

in a database and various thresholds

θ

.

Figure 8. Average numbers of nodes n in the code tree that our method traversed for various numbers of graphs

| G |

in a database and various thresholds

θ

.

Figure 9. Number of solutions

| S |

for various numbers of graphs

| G |

in a database and various thresholds

θ

.

Figure 10. Number of nodes in code trees.

Figure 11. Computation time per node

t / n

for various numbers of graphs

| G |

in a database and various thresholds

θ

.

Figure 12. Distribution of graphs in a database.

Figure 13. Computation time per solution

t / | S |

for various numbers of graphs

| G |

in a database and various thresholds

θ

.

Figure 14. Average computation time and the average number of nodes that Algorithm 4 traversed for various numbers of vertices in graphs in a database and various thresholds

θ

.

Figure 15. Computation times of Algorithm 4 with Equation (15) and Equation (17) for various numbers of graphs

| G |

in a database and various thresholds

θ

.

Table 1. Summary of graph search problems and references.

	Complete Matching	Similar Matching
graph search	${g_{i} \in G ∣ q = g_{i}}$	${g_{i} \in G ∣ e d (g_{i}, q) \leq θ}$ [25,26,27,28,29,30,31,32]
supergraph search	${g_{i} \in G ∣ g_{i} \subseteq q}$ [15,16,17,18,19,20,21,22]	${g_{i} \in G ∣ \exists q^{'} \subseteq q s . t . e d (g_{i}, q^{'}) \leq θ}$ This paper (novel problem).
subgraph search	${g_{i} \in G ∣ q \subseteq g_{i}}$ [2,3,4,5,6,7,8,9,10,11,12,13,14]	${g_{i} \in G ∣ \exists g \subseteq g_{i} s . t . e d (g, q) \leq θ}$ There are no papers to our best knowledge.

Table 2. Summary of datasets used in our experiments.

	AIDS		NCI		PubChem
	Rand	Freq	Rand	Freq	Rand	Freq
# of vertex labels	31	10	42	15	16	9
# of edge labels	3	3	3	3	3	2
graphs in databases
# of graphs	100,000	100,000	100,000	100,000	100,000	100,000
avg. # of vertices	29.0	21.4	28.2	16.9	27.6	24.1
max. # of vertices	100	26	79	27	84	32
min. # of vertices	1	10	1	5	1	11
avg. # of edges	30.0	20.4	29.2	16.1	28.2	23.1
query graphs
# of graphs	100		100		100
avg. # of vertices	71.0		63.9		64.9
max. # of vertices	222		132		175
min. # of vertices	42		40		34
avg. # of edges	74.9		68.9		67.8

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Similar Supergraph Search Based on Graph Edit Distance

Abstract

1. Introduction

2. Problem Definition

3. Related Work

4. Straightforward Method for Constrained Problem

5. Method for Traversing Prefix Tree for Constrained Problem

6. Method for Traversing Prefix Tree for Original Problem

7. Customization of Solutions

8. Experimental Evaluation

9. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics