# Similar Supergraph Search Based on Graph Edit Distance

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Problem Definition

- $(\varphi \left({v}_{1}\right),\varphi \left({v}_{2}\right))\in E$ if $({v}_{1},{v}_{2})\in {E}^{\prime}$;
- ${\ell}^{\prime}\left(v\right)=\ell \left(\varphi \left(v\right)\right)$;
- ${\ell}^{\prime}\left(({v}_{1},{v}_{2})\right)=\ell \left((\varphi \left({v}_{1}\right),\varphi \left({v}_{2}\right))\right)$.

## 3. Related Work

## 4. Straightforward Method for Constrained Problem

**Definition**

**1**

**Definition**

**2**

- $\exists t\phantom{\rule{3.33333pt}{0ex}}s.t.\phantom{\rule{3.33333pt}{0ex}}1\le t\le min(k,h)$, ${a}_{q}={b}_{q}$ for $q<t$ and ${b}_{t}{\prec}_{e}{a}_{t}$;
- $\beta \subseteq \alpha $;

**Example**

**1.**

**Lemma**

**1.**

**Proof.**

**Lemma**

**2.**

**Proof.**

**Lemma**

**3.**

**Proof.**

**Lemma**

**4.**

**Proof.**

**Lemma**

**5.**

**Proof.**

**Example**

**2.**

**Collorary**

**1.**

- (1)
- Although the algorithm produces $\left|V\right({g}_{i}\left)\right|$ prefixes in Line 6 for each AcGM code c in $\mathsf{\Omega}\left(q\right)$ and computes Equation (8) $\left|V\right({g}_{i}\left)\right|$ times in Line 7 for the prefixes, some of the repeated calculations of Equation (8) are redundant because two prefixes ${c}^{\prime}$ and ${c}^{\u2033}$ produced from c satisfy ${c}^{\prime}\subset {c}^{\prime \prime}$. Using the result of computing $ed({c}_{i},{c}^{\prime})$ to compute $ed({c}_{i},{c}^{\u2033})$ would render Algorithm 1 efficient.
- (2)
- Let ${G}_{c}=\{code\left({g}_{i}\right)\mid {g}_{i}\in G\}$ be the set of codes produced in Line 3. AcGM codes $code\left({g}_{i}\right)$ and $code\left({g}_{j}\right)$, for which their prefixes are the same, are included in ${G}_{c}$. The repeated calculations of Equation (8) between these AcGM codes and ${c}^{\prime}$ are redundant. If the common prefix of $code\left({g}_{i}\right)$ and $code\left({g}_{j}\right)$ is ${c}_{s}$, using the result of computing $ed({c}_{s},{c}^{\prime})$ to compute $ed(code\left({g}_{i}\right),{c}^{\prime})$ and $ed(code\left({g}_{j}\right),{c}^{\prime})$ would make Algorithm 1 efficient.

Algorithm 1: Straightforward Algorithm for Searching (4) |

## 5. Method for Traversing Prefix Tree for Constrained Problem

**Lemma**

**6.**

**Proof.**

**Collorary**

**2.**

**Lemma**

**7.**

**Proof.**

Algorithm 2: Straightforward Algorithm 2 for Searching (4) |

**Definition**

**3**

**Example**

**3.**

Algorithm 3: Code Tree Search for Finding Solutions to Equation (4) |

## 6. Method for Traversing Prefix Tree for Original Problem

- c is a concatenation of fragments associated with nodes on the path from the root of the code tree to n and is an AcGM code of a connected and induced graph that is a common subgraph of multiple graphs in the graph database;
- ${c}^{\prime}$ is a prefix of one of the AcGM codes in $\mathsf{\Omega}\left(q\right)$. According to the definition of the AcGM code, $g\left({c}^{\prime}\right)$ must be connected and is an induced subgraph of q.

- $ed({q}^{\prime},{g}_{1})=2$. Given a graph ${q}^{\u2033}$ obtained by removing an edge from ${q}^{\prime}$, we have $ed({q}^{\prime},{g}_{1})<ed({q}^{\u2033},{g}_{1})=3$. In the problem of finding solutions to Equation (3), we check whether there exists a subgraph of q such that $ed({q}^{\prime},{g}_{i})\le \theta $ and this subgraph need not be an induced subgraph of q. Therefore, in the case that there is an edge between two vertices ${v}_{i}$ and ${v}_{j}$ in ${q}^{\prime}$ and there is also an edge in ${g}_{i}$ between the two corresponding vertices, we do not need to consider the graph ${q}^{\u2033}$ obtained by removing an edge from ${q}^{\prime}$. That is, we do not need to replace any elements ${x}_{i,j}$ in prefixes of the AcGM codes of ${q}^{\prime}$ by 0.
- $ed({q}^{\prime},{g}_{2})=1$. For the above ${q}^{\u2033}$, we have $ed({q}^{\prime},{g}_{2})>ed({q}^{\u2033},{g}_{2})=0$. Therefore, in the case that there is an edge between two vertices ${v}_{i}$ and ${v}_{j}$ in ${q}^{\prime}$ and there is no edge in ${g}_{i}$ between the two corresponding vertices, we do not need to consider the graph that does not have an edge corresponding to edge $({v}_{i},{v}_{j})$ in ${q}^{\prime}$.
- In all other cases, there is no edge between ${v}_{i}$ and ${v}_{j}$ in ${q}^{\prime}$. Since ${x}_{i,j}=0$, we do not need to replace ${x}_{i,j}$ by 0.

**Lemma**

**8.**

Algorithm 4: Code Tree Search for Finding Solutions to Equation (3) |

## 7. Customization of Solutions

- $\langle 1\rangle $
- Editing graphs is constrained: for example, the relabeling of vertices or edges is admissible in $ed({q}^{\prime},{g}_{i})$ of Equation (3), whereas insertions and deletions are not admissible. That is, when converting a labeled graph g to an unlabeled graph $un\left(g\right)$ by removing label information from vertices and edges in g, $un\left({g}_{i}\right)$ and $un\left({q}^{\prime}\right)$ in Equation (3) are isomorphic.
- $\langle 2\rangle $
- Editing graphs is constrained: for example, editing some specific vertices and edges in a query graph q is admissible, whereas editing other vertices and edges is not admissible. For example, in the substructure drawn with blue lines and labels in Figure 1, editing its ring structure is admissible, whereas editing the remainder of the blue substructure is not admissible.

- Users can search for their desired type of similar graphs in a database containing graphs;
- This search can be realized by simply rewriting Equation (15), without the need to modify Algorithm 4.

## 8. Experimental Evaluation

## 9. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## References

- Shiraki, K. Characteristics of a candidate of an antiviral medication against COVID-19. Jpn. Med. J.
**2020**, 5005, 25–31. (In Japanese) [Google Scholar] - Bonnici, V.; Ferro, A.; Giugno, R.; Pulvirenti, A.; Shasha, D.E. Enhancing Graph Database Indexing by Suffix Tree Structure. In Proceedings of the IAPR International Conference on Pattern Recognition in Bioinformatics, Nijmegen, The Netherlands, 22–24 September 2010; pp. 195–203. [Google Scholar]
- Cheng, J.; Ke, Y.; Ng, W.; Lu, A. FG-Index: Towards Verification-Free Query Processing on Graph Databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Beijing, China, 11–14 June 2007; pp. 857–872. [Google Scholar]
- Cheng, J.; Ke, Y.; Ng, W. Efficient Query Processing on Graph Databases. ACM Trans. Database Syst.
**2009**, 2, 48. [Google Scholar] [CrossRef][Green Version] - Klein, K.; Kriege, N.M.; Mutzel, P. CT-Index: Fingerprint-based Graph Indexing Combining Cycles and Trees. In Proceedings of the IEEE International Conference on Data Engineering, Hannover, Germany, 11–16 April 2011; pp. 1115–1126. [Google Scholar]
- Shang, H.; Zhang, Y.; Lin, X.; Yu, J.X. Taming Verification Hardness: An Efficient Algorithm for Testing Subgraph Isomorphism. Proc. Vldb Endow.
**2008**, 1, 364–375. [Google Scholar] [CrossRef][Green Version] - Sun, S.; Luo, Q. Scaling Up Subgraph Query Processing with Efficient Subgraph Matching. In Proceedings of the IEEE International Conference on Data Engineering, Paris, France, 16–19 April 2019; pp. 220–231. [Google Scholar]
- Williams, D.W.; Huan, J.; Wang, W. Graph Database Indexing Using Structured Graph Decomposition. In Proceedings of the IEEE International Conference on Data Engineering, Istanbul, Turkey, 17–20 April 2007; pp. 976–985. [Google Scholar]
- Xie, Y.; Yu, P.S. CP-Index: On the Efficient Indexing of Large Graphs. In Proceedings of the ACM Conference on Information and Knowledge Management, Glasgow, UK, 24–28 October 2011; pp. 1795–1804. [Google Scholar]
- Yan, X.; Yu, P.S.; Han, J. Graph Indexing: A Frequent Structure-based Approach. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Paris, France, 13–18 June 2004; pp. 335–346. [Google Scholar]
- Yuan, D.; Mitra, P. Lindex: A Lattice-based Index for Graph Databases. VLDB J.
**2013**, 22, 229–252. [Google Scholar] [CrossRef] - Zhang, S.; Hu, M.; Yang, J. TreePi: A Novel Graph Indexing Method. In Proceedings of the IEEE International Conference on Data Engineering, Istanbul, Turkey, 17–20 April 2007; pp. 966–975. [Google Scholar]
- Zhao, P.; Yu, J.X.; Yu, P.S. Graph Indexing: Tree + Delta >= Graph. In Proceedings of the International Conference on Very Large Data Bases, Vienna, Austria, 23–27 September 2007; pp. 938–949. [Google Scholar]
- Zou, L.; Chen, L.; Yu, J.X.; Lu, Y. A Novel Spectral Coding in a Large Graph Database. In Proceedings of the International Conference on Extending Database Technology, Nantes, France, 25–29 March 2008; pp. 181–192. [Google Scholar]
- Chen, C.; Yan, X.; Yu, P.S.; Han, J.; Zhang, D.; Gu, X. Towards Graph Containment Search and Indexing. In Proceedings of the International Conference on Very Large Data Bases, Vienna, Austria, 23–27 September 2007; pp. 926–937. [Google Scholar]
- Cheng, J.; Ke, Y.; Fu, A.W.; Yu, J.X. Fast Graph Query Processing with a Low-Cost Index. VLDB J.
**2011**, 20, 521–539. [Google Scholar] [CrossRef][Green Version] - Imai, S.; Inokuchi, A. Efficient Supergraph Search Using Graph Coding. IEICE Trans. Inf. Syst.
**2020**, 103-D, 130–141. [Google Scholar] [CrossRef] - Kim, H.; Min, S.; Park, K.; Lin, X.; Hong, S.; Han, W. IDAR: Fast Supergraph Search Using DAG Integration. Proc. Vldb Endow.
**2020**, 13, 1456–1468. [Google Scholar] [CrossRef] - Lyu, B.; Qin, L.; Lin, X.; Chang, L.; Yu, J.X. Scalable Supergraph Search in Large Graph Databases. In Proceedings of the IEEE International Conference on Data Engineering, Helsinki, Finland, 16–20 May 2016; pp. 157–168. [Google Scholar]
- Yuan, D.; Mitra, P.; Giles, C.L. Mining and Indexing Graphs for Supergraph Search. Proc. Vldb Endow.
**2013**, 6, 829–840. [Google Scholar] [CrossRef] - Zhang, S.; Li, J.; Gao, H.; Zou, Z. A Novel Approach for Efficient Supergraph Query Processing on Graph Databases. In Proceedings of the International Conference on Extending Database Technology, Saint-Petersburg, Russia, 24–26 March 2009; pp. 204–215. [Google Scholar]
- Zhu, G.; Lin, X.; Zhang, W.; Wang, W.; Shang, H. PrefIndex: An Efficient Supergraph Containment Search Technique. In Proceedings of the International Conference on Scientific and Statistical Database Management, Heidelberg, Germany, 30 June–2 July 2010; pp. 360–378. [Google Scholar]
- Riesen, K. Structural Pattern Recognition with Graph Edit Distance—Approximation Algorithms and Applications. In Advances in Computer Vision and Pattern Recognition; Springer: Berlin, Germany, 2015. [Google Scholar]
- Inokuchi, A.; Washio, T.; Motoda, H. An Apriori-based Algorithm for Mining Frequent Substructures from Graph Data. In Proceedings of the European Conference on Principles of Data Mining and Knowledge Discovery, Lyon, France, 13–16 September 2000; pp. 13–23. [Google Scholar]
- Chang, L.; Feng, X.; Lin, X.; Qin, L.; Zhang, W.; Ouyang, D. Speeding Up GED Verification for Graph Similarity Search. In Proceedings of the IEEE International Conference on Data Engineering, Dallas, TX, USA, 20–24 April 2020; pp. 793–804. [Google Scholar]
- Gouda, K.; Hassaan, M. CS_GED: An Efficient Approach for Graph Edit Similarity Computation. In Proceedings of the IEEE International Conference on Data Engineering, Helsinki, Finland, 16–20 May 2016; pp. 265–276. [Google Scholar]
- Kim, J.; Choi, D.; Li, C. Inves: Incremental Partitioning-based Verification for Graph Similarity Search. In Proceedings of the International Conference on Extending Database Technology, Lisbon, Portugal, 26–29 March 2019; pp. 229–240. [Google Scholar]
- Liang, Y.; Zhao, P. Similarity Search in Graph Databases: A Multi-Layered Indexing Approach. In Proceedings of the IEEE International Conference on Data Engineering, San Diego, CA, USA, 19–22 April 2017; pp. 783–794. [Google Scholar]
- Wang, X.; Ding, X.; Tung, A.K.H.; Ying, S.; Jin, H. An Efficient Graph Indexing Method. In Proceedings of the IEEE International Conference on Data Engineering, Arlington, VA, USA, 1–5 April 2012; pp. 210–222. [Google Scholar]
- Zhao, X.; Xiao, C.; Lin, X.; Wang, W.; Ishikawa, Y. Efficient Processing of Graph Similarity Queries with Edit Distance Constraints. VLDB J.
**2013**, 22, 727–752. [Google Scholar] [CrossRef] - Zhao, X.; Xiao, C.; Lin, X.; Zhang, W.; Wang, Y. Efficient Structure Similarity Searches: A Partition-based Approach. VLDB J.
**2018**, 27, 53–78. [Google Scholar] [CrossRef] - Zheng, W.; Zou, L.; Lian, X.; Wang, D.; Zhao, D. Efficient Graph Similarity Search Over Large Graph Databases. IEEE Trans. Knowl. Data Eng.
**2015**, 27, 964–978. [Google Scholar] [CrossRef] - Inokuchi, A.; Washio, T.; Nishimura, Y.; Motoda, H. A Fast Algorithm for Mining Frequent Connected Subgraphs; IBM Research: Yorktown Heights, NY, USA, 2002. [Google Scholar]
- Yan, X.; Han, J. gSpan: Graph-Based Substructure Pattern Mining. In Proceedings of the IEEE International Conference on Data Mining, Maebashi City, Japan, 9–12 December 2002; pp. 721–724. [Google Scholar]
- Bi, F.; Chang, L.; Lin, X.; Qin, L.; Zhang, W. Efficient Subgraph Matching by Postponing Cartesian Products. In Proceedings of the International Conference on Management of Data, San Francisco, CA, USA, 26 June–1 July 2016; pp. 1199–1214. [Google Scholar]
- Sun, Z.; Wang, H.; Wang, H.; Shao, B.; Li, J. Efficient Subgraph Matching on Billion Node Graphs. Proc. Vldb Endow.
**2012**, 5, 788–799. [Google Scholar] [CrossRef][Green Version] - Zhang, S.; Li, S.; Yang, J. GADDI: Distance Index based Subgraph Matching in Biological Networks. In Proceedings of the International Conference on Extending Database Technology, Saint Petersburg, Russia, 24–26 March 2009; pp. 192–203. [Google Scholar]
- Khan, A.; Li, N.; Yan, X.; Guan, Z.; Chakraborty, S.; Tao, S. Neighborhood based Fast Graph Search in Large Networks. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Athens, Greece, 12–16 June 2011; pp. 901–912. [Google Scholar]
- Khan, A.; Wu, Y.; Aggarwal, C.C.; Yan, X. NeMa: Fast Graph Search with Label Similarity. Proc. Vldb Endow.
**2013**, 181–192. [Google Scholar] [CrossRef] - Tian, Y.; McEachin, R.C.; Santos, C.; States, D.J.; Patel, J.M. SAGA: A Subgraph Matching Tool for Biological Graphs. Bioinformatics
**2007**, 23, 232–239. [Google Scholar] [CrossRef] [PubMed][Green Version] - Zhang, S.; Yang, J.; Jin, W. SAPPER: Subgraph Indexing and Approximate Matching in Large Graphs. Proc. Vldb Endow.
**2010**, 3, 1185–1194. [Google Scholar] [CrossRef] - Borgwardt, K.M.; Ghisu, M.E.; Llinares-López, F.; O’Bray, L.; Rieck, B. Graph Kernels: State-of-the-Art and Future Challenges. Found. Trends Mach. Learn.
**2002**, 13, 531–712. [Google Scholar] [CrossRef] - Wang, X.; Smalter, A.M.; Huan, J.; Lushington, G.H. G-Hash: Towards Fast Kernel-based Similarity Search in Large Graph Databases. In Proceedings of the International Conference on Extending Database Technology, Nantes, France, 25–29 March 2008; pp. 472–480. [Google Scholar]
- Raymond, J.W.; Willett, P. Maximum Common Subgraph Isomorphism Algorithms for the Matching of Chemical Structures. J. Comput. Aided Mol. Des.
**2002**, 16, 521–533. [Google Scholar] [CrossRef] [PubMed][Green Version] - Bahiense, L.; Manic, G.; Piva, B.; de Souza, C.C. The Maximum Common Edge Subgraph Problem: A Polyhedral Investigation. Discret. Appl. Math.
**2012**, 160, 2523–2541. [Google Scholar] [CrossRef][Green Version] - Kashima, H.; Tsuda, K.; Inokuchi, A. Marginalized Kernels Between Labeled Graphs. In Proceedings of the International Conference on Machine Learning, Washington, DC, USA, 21–24 August 2003; pp. 321–328. [Google Scholar]
- Shervashidze, N.; Schweitzer, P.; van Leeuwen, E.J.; Mehlhorn, K.; Borgwardt, K.M. Weisfeiler-Lehman Graph Kernels. J. Mach. Learn. Res.
**2011**, 12, 2539–2561. [Google Scholar]

**Figure 7.**Average computation times t for various numbers of graphs $\left|G\right|$ in a database and various thresholds $\theta $.

**Figure 8.**Average numbers of nodes n in the code tree that our method traversed for various numbers of graphs $\left|G\right|$ in a database and various thresholds $\theta $.

**Figure 9.**Number of solutions $\left|S\right|$ for various numbers of graphs $\left|G\right|$ in a database and various thresholds $\theta $.

**Figure 11.**Computation time per node $t/n$ for various numbers of graphs $\left|G\right|$ in a database and various thresholds $\theta $.

**Figure 13.**Computation time per solution $t/\left|S\right|$ for various numbers of graphs $\left|G\right|$ in a database and various thresholds $\theta $.

**Figure 14.**Average computation time and the average number of nodes that Algorithm 4 traversed for various numbers of vertices in graphs in a database and various thresholds $\theta $.

Complete Matching | Similar Matching | |
---|---|---|

graph search | $\{{g}_{i}\in G\mid q={g}_{i}\}$ | $\{{g}_{i}\in G\mid ed({g}_{i},q)\le \theta \}$ [25,26,27,28,29,30,31,32] |

supergraph search | $\{{g}_{i}\in G\mid {g}_{i}\subseteq q\}$ [15,16,17,18,19,20,21,22] | $\{{g}_{i}\in G\mid \exists {q}^{\prime}\subseteq q\phantom{\rule{3.33333pt}{0ex}}s.t.\phantom{\rule{3.33333pt}{0ex}}ed({g}_{i},{q}^{\prime})\le \theta \}$ This paper (novel problem). |

subgraph search | $\{{g}_{i}\in G\mid q\subseteq {g}_{i}\}$ [2,3,4,5,6,7,8,9,10,11,12,13,14] | $\{{g}_{i}\in G\mid \exists g\subseteq {g}_{i}\phantom{\rule{3.33333pt}{0ex}}s.t.\phantom{\rule{3.33333pt}{0ex}}ed(g,q)\le \theta \}$ There are no papers to our best knowledge. |

AIDS | NCI | PubChem | ||||
---|---|---|---|---|---|---|

Rand | Freq | Rand | Freq | Rand | Freq | |

# of vertex labels | 31 | 10 | 42 | 15 | 16 | 9 |

# of edge labels | 3 | 3 | 3 | 3 | 3 | 2 |

graphs in databases | ||||||

# of graphs | 100,000 | 100,000 | 100,000 | 100,000 | 100,000 | 100,000 |

avg. # of vertices | 29.0 | 21.4 | 28.2 | 16.9 | 27.6 | 24.1 |

max. # of vertices | 100 | 26 | 79 | 27 | 84 | 32 |

min. # of vertices | 1 | 10 | 1 | 5 | 1 | 11 |

avg. # of edges | 30.0 | 20.4 | 29.2 | 16.1 | 28.2 | 23.1 |

query graphs | ||||||

# of graphs | 100 | 100 | 100 | |||

avg. # of vertices | 71.0 | 63.9 | 64.9 | |||

max. # of vertices | 222 | 132 | 175 | |||

min. # of vertices | 42 | 40 | 34 | |||

avg. # of edges | 74.9 | 68.9 | 67.8 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Yamada, M.; Inokuchi, A. Similar Supergraph Search Based on Graph Edit Distance. *Algorithms* **2021**, *14*, 225.
https://doi.org/10.3390/a14080225

**AMA Style**

Yamada M, Inokuchi A. Similar Supergraph Search Based on Graph Edit Distance. *Algorithms*. 2021; 14(8):225.
https://doi.org/10.3390/a14080225

**Chicago/Turabian Style**

Yamada, Masataka, and Akihiro Inokuchi. 2021. "Similar Supergraph Search Based on Graph Edit Distance" *Algorithms* 14, no. 8: 225.
https://doi.org/10.3390/a14080225