Similar Supergraph Search Based on Graph Edit Distance

: Subgraph and supergraph search methods are promising techniques for the development of new drugs. For example, the chemical structure of favipiravir—an antiviral treatment for inﬂuenza—resembles the structure of some components of RNA. Represented as graphs, such compounds are similar to a subgraph of favipiravir. However, the existing supergraph search meth-ods can only discover compounds that match exactly. We propose a novel problem, called similar supergraph search, and design an efﬁcient algorithm to solve it. The problem is to identify all graphs in a database that are similar to any subgraph of a query graph, where similarity is deﬁned as edit distance. Our algorithm represents the set of candidate subgraphs by a code tree, which it uses to efﬁciently compute edit distance. With a distance threshold of zero, our algorithm is equivalent to an existing efﬁcient algorithm for exact supergraph search. Our experiments show that the computation time increased exponentially as the distance threshold increased, but increased sublinearly with the number of graphs in the database.


Introduction
The coronavirus disease (COVID-19) has been spreading widely since 2019. In Japan, favipiravir (brand name: Avigan) has been examined as a promising antiviral medication against COVID-19. Favipiravir was initially developed as a medication to treat influenza. When Favipiravir is ribosylated, its structure resembles acadesine, which is a precursor to inosine. Inosine is the previous stage of guanosine and adenosine, which are components of RNA [1]. With this treatment, the RNA synthetase of a virus mistakenly incorporates favipiravir, instead of guanosine, into the replicating virus. This causes RNA synthesis to stop and limits the replication of viruses inside the body. Figure 1 shows the chemical structures of guanosine, inosine, acadesine, and ribosylated favipiravir. The substructure α, drawn in red, is common to all of guanosine, inosine, and acadesine. Therefore, in order to find candidate compounds for a new drug, it may be useful to search for compounds containing a substructure, provided as a query, in a database that consist of many compounds (i.e., substructure search). Conversely, if many substructures relevant to new drug development are known, it can be useful to search for substructures contained in a newly developed chemical compound, provided as a query, in a database recording these substructures (i.e., superstructure search). This paper discusses methods related to the latter task. The substructure drawn in blue in Figure 1 is similar to the substructure α, but they do not match perfectly. In the situation shown, the substructure α, which contributes to anti-influenza agents, is registered in a database. The database contains various substructures for which its medicinal effects, side effects and so on are known. If ribosylated favipiravir is given as the query and the substructure α is the output as the search result, we may be able to discover that ribosylated favipiravir is anti-influenza. Therefore, such a search engine for chemical compounds would be very useful and effective.  A compound can be represented as a graph, in which the vertices and edges correspond to the atoms and chemical bonds, respectively. The aforementioned substructure and superstructure search correspond to subgraph search [2][3][4][5][6][7][8][9][10][11][12][13][14] and supergraph search [15][16][17][18][19][20][21][22], respectively, which are actively being studied by researchers. A graph is a highly versatile data structure for representing information in many fields. It can be used to represent human relationship networks, protein interaction networks, resource description frameworks (RDF), computer-aided designs (CAD), and computer vision images; moreover, subgraph and supergraph search are used for many applications in addition to searching for compounds. However, there is no method for searching for graphs that are similar to subgraphs of a query graph and no method applicable to the resemblance between substructures shown in Figure 1. This paper proposes the problem of similar supergraph search and also proposes a method to solve it.
In this paper, we measure the similarity between two graphs g and g by the graph edit distance. The graph edit distance is the minimum number of edits required to transform g to another g , where an edit is either an insertion, deletion, or relabeling of a vertex or edge. The number of graphs obtained by editing the graph g θ times increases drastically, especially when the numbers of vertices and edges are large and the varieties of vertex and edge labels are wide. Such a drastic increase in edited graphs makes it difficult to find similar graphs. However, some real applications need to obtain a subset of graphs edited θ times. For example, editing the substructure in a broken ellipse of Figure 1 is not admissible, whereas editing the remainder of the blue substructure is admissible. By restricting an editable part in the structure, the number of graphs obtained by editing the graph g θ times is decreased. In our proposed method, users can search for their desired type of similar graphs in a database containing graphs and this search can be realized by simply rewriting only one function without the need to modify our proposed algorithm. Therefore, the method proposed in this paper is a general and flexible framework for searching for similar subgraphs of query graphs and is easily customizable for searching for the types of graphs desired by the user.
The remainder of the paper is organized as follows. Section 2 formalizes the novel problem of searching for graphs contained in a query graph in a database storing many graphs and Section 3 discusses related work. In Sections 4 and 5, we propose straightforward methods for a simpler problem in which the searched graphs are constrained. In Sections 6 and 7, we propose a novel method for solving the original problem by extending the above methods. In Section 8, we analyze the computational efficiency of the proposed method by using real-world datasets. Finally, we conclude the paper in Section 9.

Problem Definition
A labeled graph is expressed as g = (V, E, Σ, ), where V is a set of vertices, E ⊆ V × V is a set of edges, Σ is a set of labels for the vertices and edges, and : V ∪ E → Σ is a function to assign the labels to the vertices and edges. We denote the set of all vertices in graph g as V(g). A graph g is called a directed graph if the edges in g have directions; otherwise, g is called an undirected graph. A sequence of edges from v 1 to v 2 is called a path and a graph (for a directed graph, all edges in the graph are replaced with undirected edges.) and a path that exists between any two vertices is called connected.
Given graphs g = (V, E, Σ, ) and g = (V , E , Σ , ), ∀v, v 1 , v 2 ∈ V , when an injective function φ : V → V fulfills the following conditions, g is called a subgraph of g, which is denoted as g ⊆ g: we call g an induced subgraph of g, which is denoted as g ⊆ i g. Although this paper focuses only on undirected and connected graphs, the methods proposed can easily be extended to directed or unconnected graphs.
The graph edit distance is the minimum number of edits required to transform a graph g to another graph g , where an edit is either an insertion, deletion, or relabeling of a vertex or edge. In this paper, we assume that the cost of each edit is 1. The number of edits between graphs g = (V, E, Σ, ) and g = (V , E , Σ, ) is defined as follows: where we assume that |V| ≤ |V | and V is extended to V + = V ∪ {ε 1 , ε 2 , . . . , ε |V |−|V| } [23]. In addition, when g is transformed to g , the mapping between V and V is expressed as a bijective mapping ϕ : to v j ∈ V means performing a vertex insertion in g . Therefore, the graph edit distance ed(g, g ) between the graphs g and g is defined as follows: where Φ(α, β) is a set of all possible β-permutations of the integers 1, 2, . . . , α. The solution to Equation (2) is one of the quadratic assignment problems, which are known to be NP-complete. Given a set of graphs G = {g 1 , g 2 , . . . , g n }, a query graph q, and a threshold θ as input, the problem tackled in this paper is to output a set of graphs.
In Equation (3), note that q is a subgraph of q but not an induced subgraph of q. When θ is 0 in Equation (3), the problem is equivalent to the original supergraph search problem. Therefore, our problem, represented by Equation (3), is an extension of the supergraph search problem.

Related Work
In Section 2, we defined the problem tackled in this paper. Recently, much attention has been focused on searching for graphs satisfying some conditions in a database consisting of multiple graphs. Table 1 summarizes graph search problems and the publications that tackle each problem. The subgraph search problem is to find graphs that contain the query graph and can be expressed as follows: whereas the supergraph search problem is to find graphs that the query graph contains and it can be expressed as follows.
These problems contain the subgraph isomorphism problem, which is known to be NP-complete. The similar graph search problem is to find graphs that are similar to the query graph, and can be expressed as This problem contains the graph edit distance problem, which is known to be NPcomplete. Various methods have been proposed to solve these problems. The problem tackled in this paper is to find graphs that are similar to subgraphs of the query graph and contains both the subgraph isomorphism problem and graph edit distance problem. This problem is a novel one, to the best of our knowledge. Since the subgraph isomorphism problem and graph edit distance problem are NP-complete, a naive method for solving either of the problems for a given query graph and for each graph in the database would be unrealistic.
Most of the methods proposed in the articles cited in Table 1 are based on filtering and verification and an index in each of the methods has a set of patterns P extracted from the database in advance. For example, for the supergraph search problem, in the filtering phase, is obtained. Subsequently, the verification phase checks whether g ⊆ q, for each graph g ∈ G . Figure 2a shows four graphs in a database and two patterns contained in an index. The index has the information that p 1 ⊂ g 1 , p 2 ⊂ g 2 , p 1 ⊂ g 4 , and p 2 ⊂ g 4 . Given the query shown in Figure 2c, supergraph search methods first check whether p 1 ⊆ q and p 2 ⊆ q and then obtain G = {g 1 , g 3 } in the filtering phase. In the verification phase, they test whether g 1 ⊆ q and g 3 ⊆ q. Since the number of vertices in p 2 is less than one for g 2 , the computation time to check whether p 2 ⊆ q becomes less than one for g 2 ⊆ q and we do not need to check whether g 2 ⊆ q in the filtering phase. The patterns in P are either paths, trees, cycles, or graphs extracted from databases. The patterns are extracted by the enumeration or mining method. The enumeration method enumerates all subgraphs in graphs in a database, where the subgraphs are restricted to either paths, trees, cycles, or graphs. In contrast, the mining method enumerates all frequent patterns [24], which are defined as from graphs in a database, where σ(p, G) is defined as the number of graphs in G containing pattern p: The extracted patterns are stored in a tree or lattice structure in the index of the graph database. Enumerating patterns and mining patterns from a database are timeconsuming processes.

Complete Matching
Similar Matching [25][26][27][28][29][30][31][32] supergraph [15][16][17][18][19][20][21][22] This paper (novel problem). [2][3][4][5][6][7][8][9][10][11][12][13][14] There are no papers to our best knowledge. CodeTree [17] is a method for the supergraph search problem. It does not distinguish between filtering and verification phases; it filters out graphs that are not contained in a given query graph while verifying unfiltered graphs. In addition, CodeTree does not enumerate or mine patterns in a database. All graphs in a database are stored in the form of one of the graph codes, such as AcGM [33] code or DFS [34] code, in the index in CodeTree. The AcGM and DFS codes are based on adjacency matrices and adjacency lists, which are alternatives to graph representations. The algorithms for constructing an index for a database and searching graphs in the index are independent of the graph codes and, thus, the method can be integrated and extended with various graph codes for representing graphs in accordance with the intended characteristics of those codes. For this reason, the method proposed in this paper is based on CodeTree.
In this paper, we focus on the problems in which each database consists of multiple graphs. Other subgraph search problems are to search for a database which consists of "a single large graph". Since the number of graphs given as input is two, we need not distinguish the subgraph and supergraph searches. A typical problem of the graph search with a single large graph is to output all embeddings (or occurrences) {(φ(v 1 ), φ(v 2 ), . . . , φ(v |V(q)| ))} in a large graph g for a query q consisting of vertices (v 1 , v 2 , . . . , v |V(q)| ) when the query q is a subgraph of the large graph g. This problem is applicable to RDFs (Resource Description Frameworks), where each subject or object and predicate corresponds to a vertex and edge in a graph, respectively; PPI (Protein-Protein Interaction) networks, where each protein and interaction correspond to a vertex and edge, respectively; and intrusion alert networks where each computer and possible attack such as Denial-of-Service and TCP Service Sweep correspond a vertex and edge, respectively. Some of the methods for solving these kinds of problems use not only exact matching [35][36][37] but also approximate matching [38][39][40][41] with various similarity measures.
The most representative metric for measuring similarity between two graphs is the graph edit distance. Most of methods for graph search problems use this graph edit distance to measure similarities among graphs. Other most representative measures are to use maximum common subgraphs or graph kernels [42,43] between two graphs. There are two types of problems called maximum common induced subgraph problem [44] and maximum common edge subgraph problem [45] and the problems for finding the maximum common subgraphs in two graphs is NP-complete. In contrast, graph kernels have been proposed to classify graphs in the machine learning domain. The random walk graph kernel [46] returns a high similarity of graphs if random walks on the graphs produce similar sequences of vertex and edge labels. The Weisfeiler-Lehman graph kernel [47] uses the Weisfeiler-Lehman test, which was proposed to solve the graph isomorphism problem. The kernel iteratively relabels each vertex v by concatenating a label of v and labels of v's adjacent vertices and returns high similarity for two relabeled graphs if the graphs have many common labels.

Straightforward Method for Constrained Problem
In Section 2, we defined the problem tackled in this paper. Here, for the sake of simplicity, we first provide methods for outputting the following: by constraining the original problem. In the subsequent sections, we explain how the method can be extended to solve the original problem.
Definition 1 (AcGM code [33]). Given a graph g = (V, E, Σ, ), we assign vertex identifiers u 1 , u 2 , . . . , u |V| to vertices in g and assume that a subgraph of g induced by vertices The AcGM code of graph g is defined by concatenating the elements in the upper-right part of the adjacency matrix, as follows: where each code fragment s i is defined as follows: s 1 s 2 · · · s i is called a prefix of Equation (5). In addition, c 2 ⊆ c 1 means that a prefix of an AcGM code c 1 is equal to another AcGM code c 2 and a graph expressed by AcGM code c is denoted as g(c).
Definition 2 (Linear ordering between AcGM codes). When two AcGM codes α = a 1 a 2 · · · a k and β = b 1 b 2 · · · b h fulfill either of the following conditions, we say that β α: where ≺ e is a linear ordering between code fragments [34]. Figure 3, we have 4! = 24 possible permutations of the four vertices. Among the permutations, according to Definition 1, code(g, v 2 , v 3 , v 4 , v 1 ) does not generate an AcGM code because a subgraph induced by v 2 , v 3 is not connected. We can generate 12 AcGM codes from the graph and we show three AcGM codes as follows. Symbols on the left side of the following adjacent matrices are vertex IDs and the letters on the upper side of the matrices are vertex labels. In addition, letters in the matrices are edge labels and zeros in the matrices and they indicate that there are no edges.

Example 1. Given the graph with four vertices shown in
? ?
By concatenating the letters and zeros on sides of arrows in the matrices, three AcGM codes are generated as follows.
The order between the codes is α ≺ γ ≺ β. Proof. If two adjacency matrices are the same, the graphs represented by the matrices are isomorphic. Since each AcGM code of a graph g corresponds one-to-one with an adjacency matrix of g, if two AcGM codes are the same, the graphs represented by the codes are isomorphic.
In contrast, if two graphs are isomorphic, their AcGM codes are not necessarily the same. This means that there are multiple AcGM codes of g that fulfill Definition 1. Thus, we denote the set of all AcGM codes of graph g as Ω(g).

Lemma 2.
When an AcGM code c is a prefix of another code c , g(c) is connected and is an induced subgraph of g(c ).
Proof. Given an AcGM code c of a graph g, let its corresponding adjacency matrix be A. All submatrices A of A obtained by iteratively deleting the last row and column of A represent induced and connected subgraphs of g. According to Definition 5, AcGM codes for the matrices A must be prefices of the AcGM code for A.
From the above discussion, one straightforward method for outputting the subgraphs defined by Equation (4) is to compute the graph edit distance between a graph g i ∈ G and a graph g(c ), represented by a prefix c ⊆ c for c ∈ Ω(q). However, computing the edit distance between g i and g(c ) by Equation (2) is not feasible. Therefore, we discuss a method for computing the edit distance by using the codes of g i and g(c ). Lemma 3. Given the following AcGM codes: where a and b are the numbers of vertices in g(c) and g(c ), respectively, and the i-th vertex in g(c) maps to the i-th vertex in g(c ), the number of edits ed(c, c ) to transform g(c) to g(c ) is the following: where δ is the Kronecker delta and δ (α, β) = 1 − δ(α, β). The complexity of solving Equation (8) is O((max{a, b}) 2 ). The ed(c, c ) function is symmetric: ed(c, c ) = ed(c , c).
Proof. The first term of Equation (8) is the sum of the numbers of edits that occur because of differences between the i-th (1 ≤ i ≤ min{a, b}) vertices in g(c) and g(c ) and its second term is the sum of the numbers of edits that occur because of differences between edges (i, j) (1 ≤ i < j ≤ min{a, b}) in g(c) and g(c ). Its third term is the sum of the numbers of edits to insert vertices in g(c) or g(c ) and its fourth term is the sum of the numbers of edits to insert edges in g(c) or g(c ).

Lemma 5.
Given an arbitrary AcGM code c of a graph g , the graph edit distance between g and g is the following. Proof.
Let Ω + (g) be codes generated from all possible permutations of vertices in g. So . Some codes c + in Ω + (g) are not AcGM codes because a prefix of the permutation from which c + is generated may not induce a connected subgraph of g.
Since Ω + (g) are codes generated from all possible permutations of vertices in g, we have ed(g, g ) = min c∈Ω + (g) ed(c, c ). In addition, because Ω(g) ⊆ Ω + (g), we have min c∈Ω + (g) ed(c, c ) ≤ min c∈Ω(g) ed(c, c ). Therefore, is satisfied. Figure 4, given two graphs g and q for which the edit distance is 3, let an arbitrary AcGM code of g be AAb. Ω(q) is {AAaB0a, AAaBa0, ABaAa0, BAaA0a}. When the code AAb is compared with AAaB0a ∈ Ω(q), we need three edits which are indicated as × in Figure 4. By comparing AAb and all elements in Ω(q), we obtain the graph edit distance between q and g according to Lemma 10. From Lemma 5, we have the following corollary.
Algorithm 1 shows the pseudocode for solving the problem defined by Equation (4). In Line 3, an arbitrary AcGM code from Ω(g i ) is generated for graph g i in database G. After one of all possible AcGM codes for q is obtained in Line 4, a prefix c , which consists of j fragments c i , is generated in Line 6. Since c is a prefix of c, g(c ) must be a connected and is an induced subgraph of q. Subsequently, g i is added to the set of output graphs S if ed(c i , c ) ≤ θ, which indicates that g i is a solution in the set defined by Equation (4) according to Corollary 1. Once g i is added to S, any procedures for g i are stopped and procedures for the next graph g i+1 are started. These steps are repeated for all of the graphs ) and the algorithm has some drawbacks, as follows.
(1) Although the algorithm produces |V(g i )| prefixes in Line 6 for each AcGM code c in Ω(q) and computes Equation (8) |V(g i )| times in Line 7 for the prefixes, some of the repeated calculations of Equation (8) are redundant because two prefixes c and c produced from c satisfy c ⊂ c . Using the result of computing ed(c i , c ) to compute ed(c i , c ) would render Algorithm 1 efficient.
be the set of codes produced in Line 3. AcGM codes code(g i ) and code(g j ), for which their prefixes are the same, are included in G c . The repeated calculations of Equation (8) between these AcGM codes and c are redundant. If the common prefix of code(g i ) and code(g j ) is c s , using the result of computing ed(c s , c ) to compute ed(code(g i ), c ) and ed(code(g j ), c ) would make Algorithm 1 efficient.

Method for Traversing Prefix Tree for Constrained Problem
In order to overcome drawback (1), described in the previous section, we introduce the following lemma. Lemma 6. Given two AcGM codes c and c , we have the following: ed(pre(c, i), pre(c , i)) ≤ ed(pre(c, j), pre(c , j)) for i < j.
In the case that i is greater than the number of fragments in c, we assume that pre(c, i) is equivalent to c itself.
Proof. ed(pre(c, i + 1), pre(c , i + 1)) is the sum of ed(pre(c, i), pre(c , i)) and the number of edits between the i + 1-th fragments of pre(c, i + 1) and pre(c , i + 1)). Since the number of edits is non-negative, ed(pre(c, i), pre(c , i)) increases monotonically when i is increased. Therefore, for i < j, we have ed(pre(c, i), pre(c , i)) ≤ ed(pre(c, j), pre(c , j)).
As a consequence of Lemma 6 and Corollary 11, when ed(pre(c, i), pre(c , i)) > θ, we can stop the computation in Line 7 of Algorithm 1.

Lemma 7.
Given the two following fragments: for which their lengths are equal, the number of edits ed(s, s ) to equalize the fragments is as follows: Since ed(s, s ) is a symmetric function, ed(s, s ) = ed(s , s).
Proof. Please see the proof for Lemma 3.
Algorithm 2 shows the pseudocode for solving the problem defined by (4), which overcomes the aforementioned drawback (1). After obtaining the j-th fragments of AcGM codes c i and c in Line 7 of the algorithm, the number of edits between the fragments is calculated according to Lemma 7. The number of edits is then added to the accumulated edit count ae. According to Corollary 2, the algorithm stops the calculation of the number of edits between c i and one of the AcGM codes of q in Line 8. However, because g i is a member of the set defined in Equation (4) if the edited distance between c i and at least one AcGM code in Ω(q) is less than θ, these steps are repeated for all AcGM codes of q. The time complexity of Algorithm 2 is O(|Ω(q)| ∑ i=1 |G||V(g i )| 2 ). In order to overcome drawback (2), described in the previous section, we use a prefix tree for the AcGM codes for G. [17] (The detailed algorithm for constructing a code tree from a graph database G is given in [17].)). A code tree T is defined as the triplet ( , N, B), where ∈ N is the root node of the tree, N is a set of the nodes of the tree, and B is a set of the branches of the tree. In addition, each node is associated with a code fragment and a set of graph identifiers. Let s(n) be the concatenation of the fragments associated with nodes on the path from the root to node n. When one of the AcGM codes of g i ∈ G is the same as s(n), the set of graph identifiers for node n contains i, which is the identifier of graph g i ∈ G.

Definition 3 (Code Tree
The code fragment and graph identifiers associated with node n are denoted as f r(n) and ID(n), respectively. For the root node, f r( ) = null and ID( ) = ∅. We present an example of a code tree below.

Example 3.
A graph database G = {g 1 , g 2 , g 3 , g 4 } consists of four graphs, as shown in Figure 5. When one of the AcGM codes for each graph in G is generated, the AcGM codes for the four graphs are as follows.
From these AcGM codes, a code tree is constructed, as shown in the right part of Figure 5. Node n 6 of the tree is associated with the code fragment C0b and the set of graph identifiers ID(n) = {1}. The concatenation of fragments associated with nodes on the path from the root to node n 6 is ABaC0b, which represents graph g 1 in G.
%D^` Algorithm 3 shows the pseudocode for searching for members of the set defined in Equation (4) by traversing the code tree. At node n, Algorithm 3 adds vertex w of q to w 1 , w 2 , . . . , w h , from which the fragment s is generated, with the addition of the prefix code(q, w 1 , w 2 , . . . , w h ) of an AcGM code of q. The algorithm then recursively traverses the children of n that have fragments similar to fragment s. In contrast to Algorithm 2, which iteratively checks whether each individual graph in the graph database is a solution, Algorithm 3 checks whether multiple graphs in the database are simultaneously solutions. This is because each node in the code tree is associated with the common prefix of the AcGM codes of multiple graphs in the database. This simultaneous checking renders our search very efficient. The worst case time complexity of Algorithm 3 is the same as the time complexity of Algorithm 2. When AcGM codes of graphs in the database do not have common prefixes as one another, the time complexity of Algorithm 3 becomes worst. However, the codes usually have common prefixes as one another and the computation time that this algorithm needs is proportional to the number of nodes that Algorithm 3 traverses. In addition, at each node, Algorithm 3 needs O(h|V(q)||N|).

Method for Traversing Prefix Tree for Original Problem
At node n, for which its depth is h in the code tree, Algorithm 3 retains the accumulated edit count ae for two concatenations c and c of fragments of length h. c and c have the following restrictions: • c is a concatenation of fragments associated with nodes on the path from the root of the code tree to n and is an AcGM code of a connected and induced graph that is a common subgraph of multiple graphs in the graph database; • c is a prefix of one of the AcGM codes in Ω(q). According to the definition of the AcGM code, g(c ) must be connected and is an induced subgraph of q.
In order to solve the original problem for searching for solutions to Equation (3), we relax the restrictions as follows. The restriction that g(c ) is connected is a consequence of the fact that code(q, w 1 , w 2 , . . . , w h ) is a prefix of the AcGM codes Ω(q) of q. Therefore, in Line 4 of Algorithm 3., we relax this restriction by allowing w 1 , w 2 , . . . , w h to be all possible h-permutations of vertices{v 1 , v 2 , . . . , v |V(q)| } in q. Codes code(q, w 1 , w 2 , . . . , w h ) by some of the permutations correspond to unconnected graphs.
Next, we relax the restriction that g(c ) is an induced subgraph of q. Algorithm 1 produces prefixes of the AcGM codes of q to obtain connected and induced subgraphs of q. Suppose that we would like to obtain, in addition to the connected and induced subgraphs, all possible unconnected and non-induced subgraphs of q. In this case, we may need, in addition to the prefixes of the AcGM codes of q, all possible codes generated by replacing some elements x i,j in the prefixes by 0. This requires us to solve both the permutation problem among the vertices of q and the combination problem among the edges of q. However, we do not need to solve the latter problem for the following reasons. We consider a graph q, its induced subgraph q , and two graphs g 1 and g 2 , shown in Figure 6. We refer to two vertices in q as v i and v j .
• ed(q , g 1 ) = 2. Given a graph q obtained by removing an edge from q , we have ed(q , g 1 ) < ed(q , g 1 ) = 3. In the problem of finding solutions to Equation (3), we check whether there exists a subgraph of q such that ed(q , g i ) ≤ θ and this subgraph need not be an induced subgraph of q. Therefore, in the case that there is an edge between two vertices v i and v j in q and there is also an edge in g i between the two corresponding vertices, we do not need to consider the graph q obtained by removing an edge from q . That is, we do not need to replace any elements x i,j in prefixes of the AcGM codes of q by 0. • ed(q , g 2 ) = 1. For the above q , we have ed(q , g 2 ) > ed(q , g 2 ) = 0. Therefore, in the case that there is an edge between two vertices v i and v j in q and there is no edge in g i between the two corresponding vertices, we do not need to consider the graph that does not have an edge corresponding to edge (v i , v j ) in q .
• In all other cases, there is no edge between v i and v j in q . Since x i,j = 0, we do not need to replace x i,j by 0. Figure 6. Inclusion of graphs and similarity.
From the above discussion, we rewrite Lemma 7 as follows.
This function is asymmetric.
Since we consider all possible subgraphs of q by the function defined in Equation (8), we need to produce only the prefixes underlined above.
Algorithm 4 shows the pseudocode for searching for members of the set defined in Equation (3) by traversing the code tree. The part from Lines 4-9 searches {g i ∈ G | q ⊆ q s.t. ed(q , g i ) ≤ θ, |V(g i )| ≤ |V(q )|}, whereas the part from Lines 10-13 searches {g i ∈ G | q ⊆ q s.t. ed(q , g i ) ≤ θ, |V(g i )| > |V(q )|}. The other difference between Algorithm 3 and Algorithm 4 is Line 4, which produces codes from one of all possible h-permutations of vertices of the query graph q. The codes produced are not limited to AcGM codes, but are AGM codes [24]. In Lines 7 and 11, Equation (15) is used to calculate the similarity between two fragments. Since the codes produced in Line 4 are limited to prefixes of the AGM codes of q, we do not need to solve the combination problem among the edges of q, which enables our proposed method to be very efficient. The time complexity of Algorithm 4 is the same as one of Algorithm 3; nevertheless, the problem that Algorithm 3 solves is a constrained problem of the problem that Algorithm 4 solves.

Algorithm 4: Code Tree Search for Finding Solutions to Equation (3)
Data: The set of graphs S, query q, threshold θ, current node n, w 1 , w 2 , . . . , w h , and accumulated edit count ae.

Customization of Solutions
In the previous section, we proposed a method for searching for the set of graphs S defined by Equation (3). The number of graphs reachable by editing graph q θ times in Equation (3) increases exponentially as θ is increased. Therefore, for a large θ, the computation time required to search for the graphs in S becomes excessively long. However, some real applications need to obtain the following sets of constrained graphs rather than obtaining all the graphs in S.
1 Editing graphs is constrained: for example, the relabeling of vertices or edges is admissible in ed(q , g i ) of Equation (3), whereas insertions and deletions are not admissible. That is, when converting a labeled graph g to an unlabeled graph un(g) by removing label information from vertices and edges in g, un(g i ) and un(q ) in Equation (3) are isomorphic. 2 Editing graphs is constrained: for example, editing some specific vertices and edges in a query graph q is admissible, whereas editing other vertices and edges is not admissible. For example, in the substructure drawn with blue lines and labels in Figure 1, editing its ring structure is admissible, whereas editing the remainder of the blue substructure is not admissible.
The problem for the above constraint 1 is formalized as follows.
In order to solve this problem, Equation (15) is rewritten as follows.
ed(s, s ) For the case of s = null ∧ s = null, the first and second terms in Equation (17) are the numbers of edits that occur by relabeling the i-th vertices and edges (i, j) (1 ≤ j < i), respectively, in g ∈ G and q . The third term of Equation (17) is the sum of the numbers of edits that occur by inserting or deleting edges. For this case, in containing such insertion or deletion operations, the ed(s, s ) function returns values larger than θ and Algorithm 4 (in Line 7) backtracks the code tree.
For the case of the above constraint 2 , let some specific vertices in query graph q be V q ⊆ V(q). When Algorithm 4 visits node n in the code tree, it generates various codes by w 1 , w 2 , . . . , w h , w , in Line 4. Some vertices in w 1 , w 2 , . . . , w h , w are included in V q , but others are not. In this case, Equation (15) is rewritten as follows.
Thus, for the case of s = null ∧ s = null ∧ w i ∈ V q , the first and second terms of Equation (18) are the numbers of edits that occur by relabeling the i-th vertices and edges (i, j) (1 ≤ j < i), respectively, in g ∈ G and q , where w i and w j are elements in V q . The third term of Equation (18) is the sum of the numbers of edits that occur by editing the other vertices and edges.
In order to summarize the above discussion, the advantages of our proposed method are the following: • Users can search for their desired type of similar graphs in a database containing graphs; • This search can be realized by simply rewriting Equation (15), without the need to modify Algorithm 4.
Therefore, the method proposed in this paper is a general and flexible framework for searching for similar subgraphs of query graphs and is easily customizable for searching for the types of graphs desired by the user.

Experimental Evaluation
For our experimental evaluation, we used the three real-world datasets used in [18] which were downloadable from https://github.com/SNUCSE-CTA/IDAR on 1 March 2021. The datasets consisting of chemical compounds are called AIDS, NCI, and PubChem. Each atom, chemical bond, atom type, and bond type in a chemical compound corresponds to a vertex, edge, vertex label, and edge label, respectively. Sets of queries and databases containing graphs were generated from each of the datasets. One hundred query graphs, each containing more than 100 vertices, were selected from each of the datasets. Each generated database contained between 10,000 and 100,000 graphs and each of the graphs was generated by one of two different methods [18]. In the first method, each graph in a database consisted of vertices and edges on a random walk on a graph randomly selected from one of the datasets. This type of database is called rand. In the second method, each graph in a database was a frequent subgraph with a minimum support value of 0.1% [24] that is enumerated from one of the datasets. This type of database is called freq.
In the experiments, vertices corresponding to hydrogen were removed, which is similar to the experiments in [2,14,17,30]. Table 2 summaries graphs in databses and queries in our experiments. Since the problem of similar supergraph search is a novel problem proposed in this paper, there are no methods for solving this problem other than our proposed method to the best of our knowledge. Although we cannot compare our proposed method with other methods, our method is based on CodeTree (We use "CodeTree" to name the method proposed in [17] and "code tree" to represent the index used in CodeTree). This is proposed in [17] and our method with θ = 0 is equivalent to CodeTree. Our proposed method was implemented in Java and the experiments reported in this section were conducted on a workstation with an AMD Ryzen Threadripper 3970X 3.7 GHz CPU and 64 GB main memory.
First, we conducted experiments for Algorithm 4 with Equation (17). Figures 7-9 show the average computation time (query processing time) t, the average number of nodes n in the code tree that our method traversed, and the average number of solutions |S|, respectively, for various numbers of graphs |G| in a database and various thresholds θ. The computation time increased as the number of graphs in a database increased, as shown in Figure 7. Although the time complexity of Algorithms 1 and 2 is linearly related to the number of graphs in the database, the computation time of our method was sublinear for the number of graphs in the database because Algorithm 4 maintains the correspondence between a vertex of a query graph and vertices of multiple graphs in the database simultaneously. In contrast, the computation time increased exponentially as θ was increased, as shown in Figure 7. This is because the number of graphs reachable by editing a graph q ⊆ q θ times increases exponentially. Although the computation time increased as the database size increased for freq databases, the rate of increase was less than for rand databases. This is because the numbers of nodes in code trees for freq databases are smaller than those for rand databases and the former numbers are relatively unchanged by the increase in database size, as shown in Figure 10. The reason why the numbers of nodes in code trees are largely unaffected by database size is that frequent subgraphs enumerated from the datasets are similar to each other and the AcGM codes of similar frequent subgraphs have common prefixes. This is because a subgraph of a frequent subgraph is also a frequent subgraph, which is a consequence of the anti-monotonic property of the support value [24]. Figure 8, the number of nodes in the code tree that Algorithm 4 traversed also increased exponentially as θ increased. The shapes of the curves plotted in Figure 8 are similar to those in Figure 7 and the trend for the numbers |G| of graphs in databases and thresholds θ in Figure 8 are also similar to those in Figure 7. Figure 11 shows t/n for various numbers of graphs |G| in a database and various thresholds θ. The figure shows that t/n is almost constant and unaffected by the number of graphs in a database, which indicates that the computation time of our method is proportional to the number of nodes in the code tree that Algorithm 4 traverses.  As shown in Figure 9, the number of solutions found by Algorithm 4 was proportional to the number of graphs in the databases. Because there is a limit to the number of graphs in the database, the number of solutions did not increase exponentially as |G| increased. The rectangle in Figure 12 shows the space in which graphs in a database are distributed and each point in the rectangle shows a graph in the database. Points inside the circle are solutions for a given query q and threshold θ. If the number of graphs in database G is doubled to G , such that 2|G| = |G |, the number of points surrounded by the circle doubles, if the distribution is not changed such that P (G) = P (G ). Thus, the number of solutions found by Algorithm 4 is proportional to the numbers of graphs in the databases.      Figure 13 shows t/|S| for various numbers of graphs |G| in a database and various thresholds θ; the figure indicates the average computation time to output each solution. As the numbers of graphs in the databases increased, t/|S| decreased exponentially. This is because |S| is proportional to the number of graphs in a database, whereas the overall computation time to search for solutions is sublinear.  Figure 13. Computation time per solution t/|S| for various numbers of graphs |G| in a database and various thresholds θ. Figure 14 shows that average computation time and the number of nodes that Algorithm 4 traversed for various average numbers of vertices in graphs in a database and various thresholds θ. When the average number of vertices in graphs in a database is increased, the average computation time of Algorithm 4 is largely unaltered because the computation time of Algorithm 4 is proportional to the number of nodes that Algorithm 4 traverses, as mentioned after Algorithm 3. The number is also largely unaltered with respect to the change of the average number of the vertices. The reason why the number of nodes that Algorithm 4 traversed is largely unaltered is because of the use of the function ed(s, s ) shown in Equation (17). By using Equation (17), the relabeling of vertices or edges is admissible, whereas insertions and deletions of the vertices and edges are not admissible. Therefore, nodes that Algorithm 4 traverses are restricted. As mentioned in Section 7, one of the advantages of our proposed method is the customizability for searching for the types of graphs desired by users. By customizing the search for the types of graphs, the users can reduce not only the number of solutions to be outputted but also computation time of our proposed method.  Next, we conducted experiments for Algorithm 4 with Equations (15) and (17). Figure 15 shows the computation times for various numbers of graphs in the databases and various thresholds. The computation times with Equation (17) were much shorter than those with Equation (15). This is because the number of graphs reachable by editing a graph q ⊆ q θ times increases exponentially and editing graphs with Equation (17) is limited to the relabeling of vertices and edges, which is faster than the insertion and deletion of vertices and edges that are required with Equation (15).  Figure 15. Computation times of Algorithm 4 with Equation (15) and Equation (17) for various numbers of graphs |G| in a database and various thresholds θ.

Conclusions
In this paper, we proposed a novel problem, which is called similar supergraph search, and designed an efficient algorithm to solve the problem. The problem is to identify all graphs in a database that are similar to any subgraph of a query graph, where similarity is defined as the edit distance. The proposed algorithm has three advantages for searching a database consisting of graphs for similar graphs. The first advantage is that the algorithm checks whether multiple graphs in the database are simultaneously solutions, because each node in the code tree is associated with the common prefix of the AcGM codes of multiple graphs in the database. The second advantage is that the codes produced by our proposed algorithm are limited to prefixes of the AGM codes of q and we do not need to solve the combination problem among the edges of q. The third advantage is the customization that enables users to search for their desired type of similar graphs in a database containing graphs. Therefore, the algorithm is a general and flexible framework for searching for similar subgraphs of query graphs and it is easily customizable for searching for the types of graphs desired by the user. Our experiments showed that the computation time increased exponentially as the distance threshold increased, but increased sublinearly with the number of graphs in the database.