1. Introduction
The coronavirus disease (COVID-19) has been spreading widely since 2019. In Japan, favipiravir (brand name: Avigan) has been examined as a promising antiviral medication against COVID-19. Favipiravir was initially developed as a medication to treat influenza. When Favipiravir is ribosylated, its structure resembles acadesine, which is a precursor to inosine. Inosine is the previous stage of guanosine and adenosine, which are components of RNA [
1]. With this treatment, the RNA synthetase of a virus mistakenly incorporates favipiravir, instead of guanosine, into the replicating virus. This causes RNA synthesis to stop and limits the replication of viruses inside the body.
Figure 1 shows the chemical structures of guanosine, inosine, acadesine, and ribosylated favipiravir. The substructure 
, drawn in red, is common to all of guanosine, inosine, and acadesine. Therefore, in order to find candidate compounds for a new drug, it may be useful to search for compounds containing a substructure, provided as a query, in a database that consist of many compounds (i.e., substructure search). Conversely, if many substructures relevant to new drug development are known, it can be useful to search for substructures contained in a newly developed chemical compound, provided as a query, in a database recording these substructures (i.e., superstructure search). This paper discusses methods related to the latter task. The substructure drawn in blue in 
Figure 1 is similar to the substructure 
, but they do not match perfectly. In the situation shown, the substructure 
, which contributes to anti-influenza agents, is registered in a database. The database contains various substructures for which its medicinal effects, side effects and so on are known. If ribosylated favipiravir is given as the query and the substructure 
 is the output as the search result, we may be able to discover that ribosylated favipiravir is anti-influenza. Therefore, such a search engine for chemical compounds would be very useful and effective.
 A compound can be represented as a graph, in which the vertices and edges correspond to the atoms and chemical bonds, respectively. The aforementioned substructure and superstructure search correspond to subgraph search [
2,
3,
4,
5,
6,
7,
8,
9,
10,
11,
12,
13,
14] and supergraph search [
15,
16,
17,
18,
19,
20,
21,
22], respectively, which are actively being studied by researchers. A graph is a highly versatile data structure for representing information in many fields. It can be used to represent human relationship networks, protein interaction networks, resource description frameworks (RDF), computer-aided designs (CAD), and computer vision images; moreover, subgraph and supergraph search are used for many applications in addition to searching for compounds. However, there is no method for searching for graphs that are similar to subgraphs of a query graph and no method applicable to the resemblance between substructures shown in 
Figure 1. This paper proposes the problem of similar supergraph search and also proposes a method to solve it.
In this paper, we measure the similarity between two graphs 
g and 
 by the graph edit distance. The graph edit distance is the minimum number of edits required to transform g to another 
, where an edit is either an insertion, deletion, or relabeling of a vertex or edge. The number of graphs obtained by editing the graph 
g times increases drastically, especially when the numbers of vertices and edges are large and the varieties of vertex and edge labels are wide. Such a drastic increase in edited graphs makes it difficult to find similar graphs. However, some real applications need to obtain a subset of graphs edited 
 times. For example, editing the substructure in a broken ellipse of 
Figure 1 is not admissible, whereas editing the remainder of the blue substructure is admissible. By restricting an editable part in the structure, the number of graphs obtained by editing the graph 
g times is decreased. In our proposed method, users can search for their desired type of similar graphs in a database containing graphs and this search can be realized by simply rewriting only one function without the need to modify our proposed algorithm. Therefore, the method proposed in this paper is a general and flexible framework for searching for similar subgraphs of query graphs and is easily customizable for searching for the types of graphs desired by the user.
The remainder of the paper is organized as follows. 
Section 2 formalizes the novel problem of searching for graphs contained in a query graph in a database storing many graphs and 
Section 3 discusses related work. In 
Section 4 and 
Section 5, we propose straightforward methods for a simpler problem in which the searched graphs are constrained. In 
Section 6 and 
Section 7, we propose a novel method for solving the original problem by extending the above methods. In 
Section 8, we analyze the computational efficiency of the proposed method by using real-world datasets. Finally, we conclude the paper in 
Section 9.
  2. Problem Definition
A labeled graph is expressed as , where V is a set of vertices,  is a set of edges,  is a set of labels for the vertices and edges, and  is a function to assign the labels to the vertices and edges. We denote the set of all vertices in graph g as . A graph g is called a directed graph if the edges in g have directions; otherwise, g is called an undirected graph. A sequence of edges from  to  is called a path and a graph (for a directed graph, all edges in the graph are replaced with undirected edges.) and a path that exists between any two vertices is called connected.
Given graphs  and , , when an injective function  fulfills the following conditions,  is called a subgraph of g, which is denoted as :
-  if ; 
- ; 
- . 
In addition, when  iff , we call  an induced subgraph of g, which is denoted as . Although this paper focuses only on undirected and connected graphs, the methods proposed can easily be extended to directed or unconnected graphs.
The graph edit distance is the minimum number of edits required to transform a graph 
g to another graph 
, where an edit is either an insertion, deletion, or relabeling of a vertex or edge. In this paper, we assume that the cost of each edit is 1. The number of edits between graphs 
 and 
 is defined as follows:
      where we assume that 
 and 
V is extended to 
 [
23]. In addition, when 
g is transformed to 
, the mapping between 
V and 
 is expressed as a bijective mapping 
. 
 is the cost to edit the vertex 
 to 
 and 
 is the cost to edit edge 
 to 
. Editing 
 to 
 means performing a vertex insertion in 
. Therefore, the graph edit distance 
 between the graphs 
g and 
 is defined as follows:
      where 
 is a set of all possible 
-permutations of the integers 
. The solution to Equation (
2) is one of the quadratic assignment problems, which are known to be NP-complete.
Given a set of graphs 
, a query graph 
q, and a threshold 
 as input, the problem tackled in this paper is to output a set of graphs.
      
In Equation (
3), note that 
 is a subgraph of 
q but not an induced subgraph of 
q. When 
 is 0 in Equation (
3), the problem is equivalent to the original supergraph search problem. Therefore, our problem, represented by Equation (
3), is an extension of the supergraph search problem.
  3. Related Work
In 
Section 2, we defined the problem tackled in this paper. Recently, much attention has been focused on searching for graphs satisfying some conditions in a database consisting of multiple graphs. 
Table 1 summarizes graph search problems and the publications that tackle each problem. The subgraph search problem is to find graphs that contain the query graph and can be expressed as follows:
      whereas the supergraph search problem is to find graphs that the query graph contains and it can be expressed as follows.
      
These problems contain the subgraph isomorphism problem, which is known to be NP-complete. The similar graph search problem is to find graphs that are similar to the query graph, and can be expressed as
      
This problem contains the graph edit distance problem, which is known to be NP-complete. Various methods have been proposed to solve these problems. The problem tackled in this paper is to find graphs that are similar to subgraphs of the query graph and contains both the subgraph isomorphism problem and graph edit distance problem. This problem is a novel one, to the best of our knowledge. Since the subgraph isomorphism problem and graph edit distance problem are NP-complete, a naive method for solving either of the problems for a given query graph and for each graph in the database would be unrealistic.
Most of the methods proposed in the articles cited in 
Table 1 are based on filtering and verification and an index in each of the methods has a set of patterns 
P extracted from the database in advance. For example, for the supergraph search problem, in the filtering phase,
      
      is obtained. Subsequently, the verification phase checks whether 
, for each graph 
. 
Figure 2a shows four graphs in a database and two patterns contained in an index. The index has the information that 
, 
, 
, and 
. Given the query shown in 
Figure 2c, supergraph search methods first check whether 
 and 
 and then obtain 
 in the filtering phase. In the verification phase, they test whether 
 and 
. Since the number of vertices in 
 is less than one for 
, the computation time to check whether 
 becomes less than one for 
 and we do not need to check whether 
 in the filtering phase. The patterns in 
P are either paths, trees, cycles, or graphs extracted from databases. The patterns are extracted by the enumeration or mining method. The enumeration method enumerates all subgraphs in graphs in a database, where the subgraphs are restricted to either paths, trees, cycles, or graphs. In contrast, the mining method enumerates all frequent patterns [
24], which are defined as
      
      from graphs in a database, where 
 is defined as the number of graphs in 
G containing pattern 
p:
The extracted patterns are stored in a tree or lattice structure in the index of the graph database. Enumerating patterns and mining patterns from a database are time-consuming processes.
CodeTree [
17] is a method for the supergraph search problem. It does not distinguish between filtering and verification phases; it filters out graphs that are not contained in a given query graph while verifying unfiltered graphs. In addition, CodeTree does not enumerate or mine patterns in a database. All graphs in a database are stored in the form of one of the graph codes, such as AcGM [
33] code or DFS [
34] code, in the index in CodeTree. The AcGM and DFS codes are based on adjacency matrices and adjacency lists, which are alternatives to graph representations. The algorithms for constructing an index for a database and searching graphs in the index are independent of the graph codes and, thus, the method can be integrated and extended with various graph codes for representing graphs in accordance with the intended characteristics of those codes. For this reason, the method proposed in this paper is based on CodeTree.
In this paper, we focus on the problems in which each database consists of multiple graphs. Other subgraph search problems are to search for a database which consists of “a single large graph”. Since the number of graphs given as input is two, we need not distinguish the subgraph and supergraph searches. A typical problem of the graph search with a single large graph is to output all embeddings (or occurrences) 
 in a large graph 
g for a query 
q consisting of vertices 
 when the query 
q is a subgraph of the large graph 
g. This problem is applicable to RDFs (Resource Description Frameworks), where each subject or object and predicate corresponds to a vertex and edge in a graph, respectively; PPI (Protein-Protein Interaction) networks, where each protein and interaction correspond to a vertex and edge, respectively; and intrusion alert networks where each computer and possible attack such as Denial-of-Service and TCP Service Sweep correspond a vertex and edge, respectively. Some of the methods for solving these kinds of problems use not only exact matching [
35,
36,
37] but also approximate matching [
38,
39,
40,
41] with various similarity measures.
The most representative metric for measuring similarity between two graphs is the graph edit distance. Most of methods for graph search problems use this graph edit distance to measure similarities among graphs. Other most representative measures are to use maximum common subgraphs or graph kernels [
42,
43] between two graphs. There are two types of problems called maximum common induced subgraph problem [
44] and maximum common edge subgraph problem [
45] and the problems for finding the maximum common subgraphs in two graphs is NP-complete. In contrast, graph kernels have been proposed to classify graphs in the machine learning domain. The random walk graph kernel [
46] returns a high similarity of graphs if random walks on the graphs produce similar sequences of vertex and edge labels. The Weisfeiler–Lehman graph kernel [
47] uses the Weisfeiler–Lehman test, which was proposed to solve the graph isomorphism problem. The kernel iteratively relabels each vertex 
v by concatenating a label of 
v and labels of 
v’s adjacent vertices and returns high similarity for two relabeled graphs if the graphs have many common labels.
  4. Straightforward Method for Constrained Problem
In 
Section 2, we defined the problem tackled in this paper. Here, for the sake of simplicity, we first provide methods for outputting the following:
      by constraining the original problem. In the subsequent sections, we explain how the method can be extended to solve the original problem.
Definition 1 (AcGM code [
33]). 
Given a graph , we assign vertex identifiers  to vertices in g and assume that a subgraph of g induced by vertices  is connected. g is represented by an adjacency matrix of dimension  for which its elements  are  if  or 0 otherwise. The AcGM code of graph g is defined by concatenating the elements in the upper-right part of the adjacency matrix, as follows:where each code fragment  is defined as follows:  is called a prefix of Equation (
5). In addition, 
 means that a prefix of an AcGM code 
 is equal to another AcGM code 
 and a graph expressed by AcGM code 
c is denoted as 
.
      
Definition 2 (Linear ordering between AcGM codes). When two AcGM codes  and  fulfill either of the following conditions, we say that :
 where  is a linear ordering between code fragments [34]. Example 1. Given the graph with four vertices shown in Figure 3, we have  possible permutations of the four vertices. Among the permutations, according to Definition 1,  does not generate an AcGM code because a subgraph induced by  is not connected. We can generate 12 AcGM codes from the graph and we show three AcGM codes as follows. Symbols on the left side of the following adjacent matrices are vertex IDs and the letters on the upper side of the matrices are vertex labels. In addition, letters in the matrices are edge labels and zeros in the matrices and they indicate that there are no edges.![Algorithms 14 00225 i001 Algorithms 14 00225 i001]()
 By concatenating the letters and zeros on sides of arrows in the matrices, three AcGM codes are generated as follows. The order between the codes is .
 Lemma 1. If two AcGM codes are the same, the graphs represented by the codes are isomorphic.
 Proof.  If two adjacency matrices are the same, the graphs represented by the matrices are isomorphic. Since each AcGM code of a graph g corresponds one-to-one with an adjacency matrix of g, if two AcGM codes are the same, the graphs represented by the codes are isomorphic. □
 In contrast, if two graphs are isomorphic, their AcGM codes are not necessarily the same. This means that there are multiple AcGM codes of g that fulfill Definition 1. Thus, we denote the set of all AcGM codes of graph g as .
Lemma 2. When an AcGM code c is a prefix of another code ,  is connected and is an induced subgraph of .
 Proof.  Given an AcGM code 
c of a graph 
g, let its corresponding adjacency matrix be 
A. All submatrices 
 of 
A obtained by iteratively deleting the last row and column of 
A represent induced and connected subgraphs of 
g. According to Definition 
5, AcGM codes for the matrices 
 must be prefices of the AcGM code for 
A. □
 From the above discussion, one straightforward method for outputting the subgraphs defined by Equation (
4) is to compute the graph edit distance between a graph 
 and a graph 
, represented by a prefix 
 for 
. However, computing the edit distance between 
 and 
 by Equation (
2) is not feasible. Therefore, we discuss a method for computing the edit distance by using the codes of 
 and 
.
Lemma 3. Given the following AcGM codes:where a and b are the numbers of vertices in  and , respectively, and the i-th vertex in  maps to the i-th vertex in , the number of edits  to transform  to  is the following:where δ is the Kronecker delta and . The complexity of solving Equation (8) is . The  function is symmetric: .  Proof.  The first term of Equation (
8) is the sum of the numbers of edits that occur because of differences between the 
i-th (
) vertices in 
 and 
 and its second term is the sum of the numbers of edits that occur because of differences between edges 
 (
) in 
 and 
. Its third term is the sum of the numbers of edits to insert vertices in 
 or 
 and its fourth term is the sum of the numbers of edits to insert edges in 
 or 
. □
 Since Equation (
8) defines the number of edits between two graphs under the known mapping such that 
 for all 
, it corresponds to Equation (
1). Therefore, Equation (
2) is formalized by using Equation (
8) as the following lemmas.
Lemma 4. Given the two AcGM codes c and  defined in Equations (6) and (7), we have the following.  Proof.  According to Equation (
2), 
 is obvious. □
 Lemma 5. Given an arbitrary AcGM code  of a graph , the graph edit distance between g and  is the following.  Proof.  Let 
 be codes generated from all possible permutations of vertices in 
g. So 
. Some codes 
 in 
 are not AcGM codes because a prefix of the permutation from which 
 is generated may not induce a connected subgraph of 
g. Since 
 are codes generated from all possible permutations of vertices in 
g, we have 
. In addition, because 
, we have 
. Therefore,
        
        is satisfied. □
 Example 2. As shown in Figure 4, given two graphs g and q for which the edit distance is 3, let an arbitrary AcGM code of g be AAb.  is . When the code AAb is compared with , we need three edits which are indicated as × in Figure 4. By comparing AAb and all elements in , we obtain the graph edit distance between q and g according to Lemma 10.  From Lemma 5, we have the following corollary.
Collorary 1. If  for two AcGM codes c and , then .
 Algorithm 1 shows the pseudocode for solving the problem defined by Equation (
4). In Line 3, an arbitrary AcGM code from 
 is generated for graph 
 in database 
G. After one of all possible AcGM codes for 
q is obtained in Line 4, a prefix 
, which consists of 
j fragments 
, is generated in Line 6. Since 
 is a prefix of 
c, 
 must be a connected and is an induced subgraph of 
q. Subsequently, 
 is added to the set of output graphs 
S if 
, which indicates that 
 is a solution in the set defined by Equation (
4) according to Corollary 1. Once 
 is added to 
S, any procedures for 
 are stopped and procedures for the next graph 
 are started. These steps are repeated for all of the graphs in 
G. The time complexity of Algorithm 1 is 
 and the algorithm has some drawbacks, as follows.
- (1)
- Although the algorithm produces  -  prefixes in Line 6 for each AcGM code  c-  in  -  and computes Equation ( 8- )  -  times in Line 7 for the prefixes, some of the repeated calculations of Equation ( 8- ) are redundant because two prefixes  -  and  -  produced from  c-  satisfy  - . Using the result of computing  -  to compute  -  would render Algorithm 1 efficient. 
- (2)
- Let  -  be the set of codes produced in Line 3. AcGM codes  -  and  - , for which their prefixes are the same, are included in  - . The repeated calculations of Equation ( 8- ) between these AcGM codes and  -  are redundant. If the common prefix of  -  and  -  is  - , using the result of computing  -  to compute  -  and  -  would make Algorithm 1 efficient. 
| Algorithm 1: Straightforward Algorithm for Searching (4) | 
| ![Algorithms 14 00225 i002 Algorithms 14 00225 i002]() | 
  5. Method for Traversing Prefix Tree for Constrained Problem
In order to overcome drawback (1), described in the previous section, we introduce the following lemma.
Lemma 6. Given two AcGM codes c and , we have the following: In the case that i is greater than the number of fragments in c, we assume that  is equivalent to c itself.
 Proof.   is the sum of  and the number of edits between the -th fragments of  and . Since the number of edits is non-negative,  increases monotonically when i is increased. Therefore, for , we have . □
 Collorary 2. According to Lemma 6, if , then .
 As a consequence of Lemma 6 and Corollary 11, when , we can stop the computation in Line 7 of Algorithm 1.
Lemma 7. Given the two following fragments:for which their lengths are equal, the number of edits  to equalize the fragments is as follows: Since  is a symmetric function, .
 Proof.  Please see the proof for Lemma 3. □
 Algorithm 2 shows the pseudocode for solving the problem defined by (
4), which overcomes the aforementioned drawback (1). After obtaining the 
j-th fragments of AcGM codes 
 and 
c in Line 7 of the algorithm, the number of edits between the fragments is calculated according to Lemma 7. The number of edits is then added to the accumulated edit count 
. According to Corollary 2, the algorithm stops the calculation of the number of edits between 
 and one of the AcGM codes of 
q in Line 8. However, because 
 is a member of the set defined in Equation (
4) if the edited distance between 
 and at least one AcGM code in 
 is less than 
, these steps are repeated for all AcGM codes of 
q. The time complexity of Algorithm 2 is 
.
      
| Algorithm 2: Straightforward Algorithm 2 for Searching (4) | 
| ![Algorithms 14 00225 i003 Algorithms 14 00225 i003]() | 
In order to overcome drawback (2), described in the previous section, we use a prefix tree for the AcGM codes for G.
Definition 3 (Code Tree [
17] (The detailed algorithm for constructing a code tree from a graph database 
G is given in [
17].)). 
A code tree T is defined as the triplet , where  is the root node of the tree, N is a set of the nodes of the tree, and B is a set of the branches of the tree. In addition, each node is associated with a code fragment and a set of graph identifiers. Let  be the concatenation of the fragments associated with nodes on the path from the root to node n. When one of the AcGM codes of  is the same as , the set of graph identifiers for node n contains i, which is the identifier of graph . The code fragment and graph identifiers associated with node n are denoted as  and , respectively. For the root node,  and . We present an example of a code tree below.
Example 3. A graph database  consists of four graphs, as shown in Figure 5. When one of the AcGM codes for each graph in G is generated, the AcGM codes for the four graphs are as follows. From these AcGM codes, a code tree is constructed, as shown in the right part of Figure 5. Node  of the tree is associated with the code fragment  and the set of graph identifiers . The concatenation of fragments associated with nodes on the path from the root to node  is , which represents graph  in G.  Algorithm 3 shows the pseudocode for searching for members of the set defined in Equation (
4) by traversing the code tree. At node 
n, Algorithm 3 adds vertex 
w of 
q to 
, from which the fragment 
s is generated, with the addition of the prefix 
 of an AcGM code of 
q. The algorithm then recursively traverses the children of 
n that have fragments similar to fragment 
s. In contrast to Algorithm 2, which iteratively checks whether each individual graph in the graph database is a solution, Algorithm 3 checks whether multiple graphs in the database are simultaneously solutions. This is because each node in the code tree is associated with the common prefix of the AcGM codes of multiple graphs in the database. This simultaneous checking renders our search very efficient. The worst case time complexity of Algorithm 3 is the same as the time complexity of Algorithm 2. When AcGM codes of graphs in the database do not have common prefixes as one another, the time complexity of Algorithm 3 becomes worst. However, the codes usually have common prefixes as one another and the computation time that this algorithm needs is proportional to the number of nodes that Algorithm 3 traverses. In addition, at each node, Algorithm 3 needs 
.
      
| Algorithm 3: Code Tree Search for Finding Solutions to Equation (4) | 
| ![Algorithms 14 00225 i004 Algorithms 14 00225 i004]() | 
  6. Method for Traversing Prefix Tree for Original Problem
At node n, for which its depth is h in the code tree, Algorithm 3 retains the accumulated edit count  for two concatenations c and  of fragments of length h. c and  have the following restrictions:
- c is a concatenation of fragments associated with nodes on the path from the root of the code tree to n and is an AcGM code of a connected and induced graph that is a common subgraph of multiple graphs in the graph database; 
-  is a prefix of one of the AcGM codes in . According to the definition of the AcGM code,  must be connected and is an induced subgraph of q. 
In order to solve the original problem for searching for solutions to Equation (
3), we relax the restrictions as follows. The restriction that 
 is connected is a consequence of the fact that 
 is a prefix of the AcGM codes 
 of 
q. Therefore, in Line 4 of Algorithm 3., we relax this restriction by allowing 
 to be all possible 
h-permutations of 
 in 
q. Codes 
 by some of the permutations correspond to unconnected graphs.
Next, we relax the restriction that 
 is an induced subgraph of 
q. Algorithm 1 produces prefixes of the AcGM codes of 
q to obtain connected and induced subgraphs of 
q. Suppose that we would like to obtain, in addition to the connected and induced subgraphs, all possible unconnected and non-induced subgraphs of 
q. In this case, we may need, in addition to the prefixes of the AcGM codes of 
q, all possible codes generated by replacing some elements 
 in the prefixes by 0. This requires us to solve both the permutation problem among the vertices of 
q and the combination problem among the edges of 
q. However, we do not need to solve the latter problem for the following reasons. We consider a graph 
q, its induced subgraph 
, and two graphs 
 and 
, shown in 
Figure 6. We refer to two vertices in 
 as 
 and 
.
- . Given a graph  -  obtained by removing an edge from  - , we have  - . In the problem of finding solutions to Equation ( 3- ), we check whether there exists a subgraph of  q-  such that  -  and this subgraph need not be an induced subgraph of  q- . Therefore, in the case that there is an edge between two vertices  -  and  -  in  -  and there is also an edge in  -  between the two corresponding vertices, we do not need to consider the graph  -  obtained by removing an edge from  - . That is, we do not need to replace any elements  -  in prefixes of the AcGM codes of  -  by 0. 
- . For the above , we have . Therefore, in the case that there is an edge between two vertices  and  in  and there is no edge in  between the two corresponding vertices, we do not need to consider the graph that does not have an edge corresponding to edge  in . 
- In all other cases, there is no edge between  and  in . Since , we do not need to replace  by 0. 
From the above discussion, we rewrite Lemma 7 as follows.
Lemma 8. This function is asymmetric.  Since we consider all possible subgraphs of 
q by the function defined in Equation (
8), we need to produce only the prefixes underlined above.
Algorithm 4 shows the pseudocode for searching for members of the set defined in Equation (
3) by traversing the code tree. The part from Lines 4–9 searches 
, whereas the part from Lines 10–13 searches 
. The other difference between Algorithm 3 and Algorithm 4 is Line 4, which produces codes from one of all possible 
h-permutations of vertices of the query graph 
q. The codes produced are not limited to AcGM codes, but are AGM codes [
24]. In Lines 7 and 11, Equation (
15) is used to calculate the similarity between two fragments. Since the codes produced in Line 4 are limited to prefixes of the AGM codes of 
q, we do not need to solve the combination problem among the edges of 
q, which enables our proposed method to be very efficient. The time complexity of Algorithm 4 is the same as one of Algorithm 3; nevertheless, the problem that Algorithm 3 solves is a constrained problem of the problem that Algorithm 4 solves.
      
| Algorithm 4: Code Tree Search for Finding Solutions to Equation (3) | 
| ![Algorithms 14 00225 i005 Algorithms 14 00225 i005]() | 
  7. Customization of Solutions
In the previous section, we proposed a method for searching for the set of graphs 
S defined by Equation (
3). The number of graphs reachable by editing graph 
 times in Equation (
3) increases exponentially as 
 is increased. Therefore, for a large 
, the computation time required to search for the graphs in 
S becomes excessively long. However, some real applications need to obtain the following sets of constrained graphs rather than obtaining all the graphs in 
S.
- Editing graphs is constrained: for example, the relabeling of vertices or edges is admissible in  -  of Equation ( 3- ), whereas insertions and deletions are not admissible. That is, when converting a labeled graph  g-  to an unlabeled graph  -  by removing label information from vertices and edges in  g- ,  -  and  -  in Equation ( 3- ) are isomorphic. 
- Editing graphs is constrained: for example, editing some specific vertices and edges in a query graph  q-  is admissible, whereas editing other vertices and edges is not admissible. For example, in the substructure drawn with blue lines and labels in  Figure 1- , editing its ring structure is admissible, whereas editing the remainder of the blue substructure is not admissible. 
The problem for the above constraint 
 is formalized as follows.
      
In order to solve this problem, Equation (
15) is rewritten as follows.
      
For the case of 
, the first and second terms in Equation (
17) are the numbers of edits that occur by relabeling the 
i-th vertices and edges 
, respectively, in 
 and 
. The third term of Equation (
17) is the sum of the numbers of edits that occur by inserting or deleting edges. For this case, in containing such insertion or deletion operations, the 
 function returns values larger than 
 and Algorithm 4 (in Line 7) backtracks the code tree.
For the case of the above constraint 
, let some specific vertices in query graph 
q be 
. When Algorithm 4 visits node 
n in the code tree, it generates various codes by 
, in Line 4. Some vertices in 
 are included in 
, but others are not. In this case, Equation (
15) is rewritten as follows.
      
Thus, for the case of 
, the first and second terms of Equation (
18) are the numbers of edits that occur by relabeling the 
i-th vertices and edges 
, respectively, in 
 and 
, where 
 and 
 are elements in 
. The third term of Equation (
18) is the sum of the numbers of edits that occur by editing the other vertices and edges.
In order to summarize the above discussion, the advantages of our proposed method are the following:
Therefore, the method proposed in this paper is a general and flexible framework for searching for similar subgraphs of query graphs and is easily customizable for searching for the types of graphs desired by the user.
  8. Experimental Evaluation
For our experimental evaluation, we used the three real-world datasets used in [
18] which were downloadable from 
https://github.com/SNUCSE-CTA/IDAR on 1 March 2021. The datasets consisting of chemical compounds are called AIDS, NCI, and PubChem. Each atom, chemical bond, atom type, and bond type in a chemical compound corresponds to a vertex, edge, vertex label, and edge label, respectively. Sets of queries and databases containing graphs were generated from each of the datasets. One hundred query graphs, each containing more than 100 vertices, were selected from each of the datasets. Each generated database contained between 10,000 and 100,000 graphs and each of the graphs was generated by one of two different methods [
18]. In the first method, each graph in a database consisted of vertices and edges on a random walk on a graph randomly selected from one of the datasets. This type of database is called 
rand. In the second method, each graph in a database was a frequent subgraph with a minimum support value of 0.1% [
24] that is enumerated from one of the datasets. This type of database is called 
freq. In the experiments, vertices corresponding to hydrogen were removed, which is similar to the experiments in [
2,
14,
17,
30]. 
Table 2 summaries graphs in databses and queries in our experiments.
Since the problem of similar supergraph search is a novel problem proposed in this paper, there are no methods for solving this problem other than our proposed method to the best of our knowledge. Although we cannot compare our proposed method with other methods, our method is based on CodeTree (We use “CodeTree” to name the method proposed in [
17] and “code tree” to represent the index used in CodeTree). This is proposed in [
17] and our method with 
 is equivalent to CodeTree. Our proposed method was implemented in Java and the experiments reported in this section were conducted on a workstation with an AMD Ryzen Threadripper 3970X 3.7 GHz CPU and 64 GB main memory.
First, we conducted experiments for Algorithm 4 with Equation (
17). 
Figure 7, 
Figure 8 and 
Figure 9 show the average computation time (query processing time) 
t, the average number of nodes 
n in the code tree that our method traversed, and the average number of solutions 
, respectively, for various numbers of graphs 
 in a database and various thresholds 
. The computation time increased as the number of graphs in a database increased, as shown in 
Figure 7. Although the time complexity of Algorithms 1 and 2 is linearly related to the number of graphs in the database, the computation time of our method was sublinear for the number of graphs in the database because Algorithm 4 maintains the correspondence between a vertex of a query graph and vertices of multiple graphs in the database simultaneously. In contrast, the computation time increased exponentially as 
 was increased, as shown in 
Figure 7. This is because the number of graphs reachable by editing a graph 
 times increases exponentially. Although the computation time increased as the database size increased for 
freq databases, the rate of increase was less than for 
rand databases. This is because the numbers of nodes in code trees for 
freq databases are smaller than those for 
rand databases and the former numbers are relatively unchanged by the increase in database size, as shown in 
Figure 10. The reason why the numbers of nodes in code trees are largely unaffected by database size is that frequent subgraphs enumerated from the datasets are similar to each other and the AcGM codes of similar frequent subgraphs have common prefixes. This is because a subgraph of a frequent subgraph is also a frequent subgraph, which is a consequence of the anti-monotonic property of the support value [
24].
As shown in 
Figure 8, the number of nodes in the code tree that Algorithm 4 traversed also increased exponentially as 
 increased. The shapes of the curves plotted in 
Figure 8 are similar to those in 
Figure 7 and the trend for the numbers 
 of graphs in databases and thresholds 
 in 
Figure 8 are also similar to those in 
Figure 7. 
Figure 11 shows 
 for various numbers of graphs 
 in a database and various thresholds 
. The figure shows that 
 is almost constant and unaffected by the number of graphs in a database, which indicates that the computation time of our method is proportional to the number of nodes in the code tree that Algorithm 4 traverses.
As shown in 
Figure 9, the number of solutions found by Algorithm 4 was proportional to the number of graphs in the databases. Because there is a limit to the number of graphs in the database, the number of solutions did not increase exponentially as 
 increased. The rectangle in 
Figure 12 shows the space in which graphs in a database are distributed and each point in the rectangle shows a graph in the database. Points inside the circle are solutions for a given query 
q and threshold 
. If the number of graphs in database 
G is doubled to 
, such that 
, the number of points surrounded by the circle doubles, if the distribution is not changed such that 
. Thus, the number of solutions found by Algorithm 4 is proportional to the numbers of graphs in the databases.
Figure 13 shows 
 for various numbers of graphs 
 in a database and various thresholds 
; the figure indicates the average computation time to output each solution. As the numbers of graphs in the databases increased, 
 decreased exponentially. This is because 
 is proportional to the number of graphs in a database, whereas the overall computation time to search for solutions is sublinear.
 Figure 14 shows that average computation time and the number of nodes that Algorithm 4 traversed for various average numbers of vertices in graphs in a database and various thresholds 
. When the average number of vertices in graphs in a database is increased, the average computation time of Algorithm 4 is largely unaltered because the computation time of Algorithm 4 is proportional to the number of nodes that Algorithm 4 traverses, as mentioned after Algorithm 3. The number is also largely unaltered with respect to the change of the average number of the vertices. The reason why the number of nodes that Algorithm 4 traversed is largely unaltered is because of the use of the function 
 shown in Equation (
17). By using Equation (
17), the relabeling of vertices or edges is admissible, whereas insertions and deletions of the vertices and edges are not admissible. Therefore, nodes that Algorithm 4 traverses are restricted. As mentioned in 
Section 7, one of the advantages of our proposed method is the customizability for searching for the types of graphs desired by users. By customizing the search for the types of graphs, the users can reduce not only the number of solutions to be outputted but also computation time of our proposed method.
 Next, we conducted experiments for Algorithm 4 with Equations (
15) and (
17). 
Figure 15 shows the computation times for various numbers of graphs in the databases and various thresholds. The computation times with Equation (
17) were much shorter than those with Equation (
15). This is because the number of graphs reachable by editing a graph 
 times increases exponentially and editing graphs with Equation (
17) is limited to the relabeling of vertices and edges, which is faster than the insertion and deletion of vertices and edges that are required with Equation (
15).