# Optimized Distributed Subgraph Matching Algorithm Based on Partition Replication

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Related Work

## 3. Problem Definition

**Definition**

**1.**

**Label graph**A labeled graph is a quadruple of the form $G(V,E,L,F)$, where V is a set of vertices, $e({u}_{i},{u}_{j})\in E$ is a set of edges, L is a set of label on vertices and edges, and F is a labeling function of the form $F:V\cup E\to L$, such that it gives a label to each vertex and edge.

**Definition**

**2.**

**Subgraph**Given graphs ${G}_{1}<{V}_{1},{E}_{1},{L}_{1},{F}_{1}>$ and ${G}_{2}<{V}_{2},{E}_{2},{L}_{2},{F}_{2}>$, graph ${G}_{1}$ is the subgraph of graph ${G}_{2}$ if and only if:

- (1)
- ${V}_{1}\subseteq {V}_{2}$, ${E}_{1}\subseteq {E}_{2}$, ${L}_{1}\subseteq {L}_{2}$;
- (2)
- $\forall v\in {V}_{1}$, ${F}_{1}\left(v\right)={F}_{2}\left(v\right)$;
- (3)
- $\forall e({v}_{1},{v}_{2})\in {E}_{1}$, ${F}_{1}\left(e\right)={F}_{2}\left(e\right)$.

**Definition**

**3.**

**Subgraph matching**Given a query graph ${G}_{q}<{V}_{q},{E}_{q},{L}_{q},{F}_{q}>$ and a graph database $D=\{{G}_{1},{G}_{2}\cdots {G}_{n}\}$, subgraph matching problem or subgraph isomorphism problem is to find all data graphs or subgraphs isomorphism of query graph ${G}_{q}$ in graph database D.

**Definition**

**4.**

**Graph isomorphism**Given graphs ${G}_{1}<{V}_{1},{E}_{1},{L}_{1},{F}_{1}>$ and ${G}_{2}<{V}_{2},{E}_{2},{L}_{2},{F}_{2}>$, ${G}_{1}$ and ${G}_{2}$ is graph isomorphism if and only if there is an injective function $f:{V}_{1}\to {V}_{2}$ such that the conditions hold:

- (1)
- $\forall v\in {V}_{1}$, ${F}_{1}\left(v\right)={F}_{2}\left(f\left(v\right)\right)$;
- (2)
- $\forall {e}_{1}({v}_{1},{v}_{2})\in {E}_{1}$, $\exists {e}_{2}(f\left({v}_{1}\right),f\left({v}_{2}\right))\in {E}_{2}$, ${F}_{1}\left({e}_{1}\right)={F}_{2}\left({e}_{2}\right)$.

## 4. Proposed PR-Match Algorithm

#### 4.1. Graph Data Partition

**Definition**

**5.**

**Distributed graph**A distributed graph $G<V,E,L,F>$ consists of a set of partitions $F=\left(\right)open="\{"\; close="\}">{F}_{1},{F}_{2},\cdots ,{F}_{k}$, where each ${F}_{i}$ is specified by $<{V}_{c}^{i}\cup {V}_{e}^{i},{E}_{c}^{i}\cup {E}_{e}^{i},{L}_{i},{F}_{i}>(i\in 1,2,\cdots ,k)$ such that:

- (1)
- ${V}_{c}^{1},{V}_{c}^{2},\cdots ,{V}_{c}^{k}$ is a partition of V, $\forall i,j\in 1,2,\cdots ,k,i\ne j,{V}_{c}^{i}\cap {V}_{c}^{j}=\varnothing $, and ${U}_{i\in 1,2,\cdots ,k}{V}_{c}^{i}=V$, ${V}_{c}^{i}$ is called as core vertex of ${F}_{i}$;
- (2)
- $e({v}_{1},{v}_{2})\in {E}_{c}^{i}$, where ${v}_{1}\in {V}_{c}^{i},{v}_{2}\in {V}_{c}^{i}$, ${E}_{c}^{i}$ is called as core edge of ${F}_{i}$;
- (3)
- ${E}_{e}^{i}$ is a set of crossing edges between ${F}_{i}$ and other partitions, ${E}_{e}^{i}$ is called as extended edge of ${F}_{i}$;
- (4)
- $e({v}_{1},{v}_{2})\in {E}_{e}^{i}$, where ${v}_{1}\in {V}_{c}^{i},{v}_{2}\in {V}_{c}^{j}$ and $i\ne j$, ${V}_{e}^{i}$ is called as extended vertex of ${F}_{i}$.

Algorithm 1: Graph Data Partition Algorithm |

#### 4.2. Query Decomposition

**Definition**

**6.**

**Hopping number**Given a graph $G<V,E,L,F>$, the hop number between vertex ${v}_{i}$ and vertex ${v}_{j}$ is denoted as $hop({v}_{i},{v}_{j})$, which is the minimum distance between ${v}_{i}$ and ${v}_{j}$ in the graph. Similarly the hop number between vertex v and edge e is denoted as $hop(v,e({v}_{i},{v}_{j}))$, which is “$min(hop(v,{v}_{i}),hop(v,{v}_{j}))+1$” meaning the minimal number of crossing edges that ${v}_{i}$ reaching to e.

**Definition**

**7.**

**Star graph**Graph $G<V,E,L,F>$ is called as a star graph, if and only if:

- (1)
- $\exists {v}_{0}\in V,\forall v\in V$ when $v\ne {v}_{0},hop({v}_{0},v)=1$, ${v}_{0}$ is called as the center point of graph G;
- (2)
- $\forall e\in E$, $hop({v}_{0},e)=1$ where ${v}_{0}$ is the center point of the graph G.

**Theorem**

**1.**

**Proof.**

Algorithm 2: Query Graph Decomposition Algorithm |

#### 4.3. Subquery Matching

**Definition**

**8.**

**Neighbor label signature**The neighbor label signature of vertex v is denoted by $Sig\left(v\right)$, which is represented by a tuple $<{P}_{n}\left(v\right),{P}_{e}\left(v\right)>$, where ${P}_{n}\left(v\right)$ is a label of multiple sets of all its neighbor vertices, ${P}_{e}$(v) is a label of multiple sets of edges between vertex and its neighbors, that is:

- (1)
- $I\in {P}_{n}\left(v\right)\Rightarrow \exists {v}^{{}^{\prime}}\in N\left(u\right),I={L}_{v}\left(v\right)$;
- (2)
- $I\in {P}_{e}\left(v\right)\Rightarrow \exists {v}^{{}^{\prime}}\in N\left(v\right),e(v,{v}^{{}^{\prime}})\in E,I={L}_{e}\left(e\right)$.

**Theorem**

**2.**

**Definition**

**9.**

**Label code**Given a label l, the number of non-negative hash functions m, the label code of label l is denoted by $Encode\left(l\right)$ which is a binary string I with a length of K, where I is initialized to 0, and each of the values satisfies the following formula: where $I\left[j\right]$ represents the value of the ${j}_{th}$ bit in the binary string I.

**Definition**

**10.**

**Vertex code**Given a vertex v, the neighbor label signature of point v is signed to $Sig\left(v\right)=<{P}_{n}\left(v\right),{P}_{e}\left(v\right)>$, and the vertex code of vertex v is denoted by $Encode\left(v\right)=p\diamond q$, where p is a counting string of all labels encoded in ${P}_{n}\left(v\right)$, and q is a counting string of all labels encoded in ${P}_{e}\left(v\right)$. ⋄ is a join operation for counting strings, and $|Encode(v)|=2k$, that is:

**Theorem**

**3.**

Algorithm 3: Subquery Matching |

#### 4.4. Intermediate Result Merge

**Definition**

**11.**

**Merge plan**The partition result of the query graph Q is $T={q}_{1},{q}_{2},\cdots ,{q}_{n}$, and its matching result on all cluster nodes is $M={M}_{1},{M}_{2},\cdots ,{M}_{n}$, and $\mathsf{\Omega}={M}_{s1}\bowtie {M}_{s2}\bowtie \cdots \bowtie {M}_{sn}$ represents a merge plan for the matching result of subquery. The star graph corresponding to the ${M}_{si}$ has intersecting vertices with a subquery graph before ${M}_{si}$ in the merge plan sequences. ${M}_{si}\in M$, ⋈ represents the merge operation.

**Definition**

**12.**

**Merge cost**A graph database D has been stored on m machines, the query graph Q is decomposed into n star query graphs, the matching results of subquery ${q}_{i}$ on the node k is $P{M}_{k}^{i}$, then the merge overhead of the merge plan Ω is:

**Definition**

**13.**

**Optimal merge plan**Given the matching results of all subqueries on the partition, the merge plan is the optimal merge plan if and only if for any merge plan ${\mathsf{\Omega}}^{{}^{\prime}}$, $Cost\left(\mathsf{\Omega}\right)\le Cost\left({\mathsf{\Omega}}^{{}^{\prime}}\right)$.

**Definition**

**14.**

**Prediction merge cost**A graph database D has been stored on m machines, the partition result of the query graph Q is $T={q}_{1},{q}_{2},\cdots ,\phantom{\rule{3.33333pt}{0ex}}{q}_{n}$, and its matching result on all cluster nodes is $M={M}_{1},{M}_{2},\cdots ,{M}_{n}$, and $\mathsf{\Omega}={M}_{s1}\bowtie {M}_{s2}\bowtie \cdots \bowtie {M}_{sn}$ represents a merge plan for the matching result of subquery, then the prediction merge cost of the merge plan Ω such that:

- (1)
- The prediction merge cost of the matching result ${M}_{si}$ and the matching result ${M}_{sj}$ is:$$P-Cost({M}_{si}\bowtie {M}_{sj})=({\sum}_{i=1}^{m}(|P{M}_{i}^{si}|+1))\times ({\sum}_{i=1}^{m}(|P{M}_{i}^{sj}|+1))$$
- (2)
- The prediction merge cost of merging operation ${O}_{i}$ and matching result ${M}_{si+1}$ is:$$P-Cost({O}_{i}\bowtie {M}_{si+1})=p-Cost({O}_{i}\times ({\sum}_{i=1}^{m}(|P{M}_{i}^{si}|+1))\times {\left(\frac{1}{2}\right)}^{\alpha})$$
- (3)
- The prediction merge cost of merge plan Ω is:$$P-Cost\left(\mathsf{\Omega}\right)={\sum}_{i\in 1,2,\cdots ,n-1}(P-Cost({O}_{i}\bowtie {M}_{i+1}))$$

Algorithm 4: Subquery Matching Result Merge | ||

input: | optimal merge plan $\mathsf{\Omega}={M}_{{s}_{1}}\bowtie {M}_{{}_{s}2\bowtie \cdots \bowtie {M}_{{}_{s}n}}$, the matching results of all subqueries | |

on all partitions $PM=\left(\right)open="\{"\; close="\}">\cup P{M}_{i}^{{s}_{1}},\cup P{M}_{i}^{{s}_{2}},\cdots ,\cup P{M}_{i}^{{s}_{n}}$ | ||

output: | all the matching subgraph $MG$ of original query graph on graph database D | |

1 | $MG$$\leftarrow \varnothing ,M\leftarrow \varnothing ,\mathsf{curDepth}\leftarrow 1,\mathsf{maxDepth}\leftarrow n+1$; | |

2 | Call recusiveJoin (curDepth, maxDepth, M, $MG$); | |

3 | return $MG$; |

Algorithm 5: Merge Subroutine recursiveJoin |

## 5. Experiments

- (1)
- The subgraph matching on the small graph set uses the AIDS real data and the synthesized dataset generated by GraphGen;
- (2)

#### 5.1. Subgraph Matching on Small Graphs

#### 5.2. Subgraph Matching on a Single Large Graph

#### 5.2.1. Path Query

#### 5.2.2. Clique Query

#### 5.2.3. Random Query

#### 5.3. Scalability Test of PR-Match Algorithm

#### 5.3.1. Data Size

#### 5.3.2. Average Vertex Degree

#### 5.4. Experiment Summary

## 6. Conclusions

## Author Contributions

## Funding

## Acknowledgments

## Conflicts of Interest

## References

- Guo, W.; Shi, Y.; Wang, S.; Xiong, N. An unsupervised embedding learning feature representation scheme for network big data analysis. IEEE Trans. Netw. Sci. Eng.
**2019**, 1. [Google Scholar] [CrossRef] - Cheng, H.; Xie, Z.; Shi, Y.; Xiong, N. Multi-step data prediction in wireless sensor networks based on one-dimensional CNN and bidirectional LSTM. IEEE Access
**2019**, 7, 117883–117896. [Google Scholar] [CrossRef] - Cheng, H.; Su, Z.; Xiong, N.; Xiao, Y. Energy-efficient node scheduling algorithms for wireless sensor networks using Markov Random Field model. Inf. Sci.
**2016**, 329, 461–477. [Google Scholar] [CrossRef] - Zheng, H.; Guo, W.; Xiong, N. A kernel-based compressive sensing approach for mobile data gathering in wireless sensor network systems. IEEE Trans. Syst. Man Cybern. Syst.
**2017**, 1–13. [Google Scholar] [CrossRef] - Ullmann, J.R. An algorithm for subgraph isomorphism. J. ACM
**1976**, 23, 31–42. [Google Scholar] [CrossRef] [Green Version] - Cheng, H.; Feng, D.; Shi, X.; Chen, C. Data quality analysis and cleaning strategy for wireless sensor networks. Eurasip J. Wirel. Commun. Netw.
**2018**, 61. [Google Scholar] [CrossRef] - Sang, Y.; Shen, H.; Tan, Y.; Xiong, N. Efficient protocols for privacy preserving matching against distributed datasets. In Proceedings of the International Conference on Information and Communications Security, Raleigh, NC, USA, 4–7 December 2006; pp. 210–227. [Google Scholar] [CrossRef]
- Han, W.S.; Lee, J.; Pham, M.D.; Yu, J.X. iGraph: A framework for comparisons of disk-based graph indexing techniques. Proc. Vldb Endow.
**2010**, 3, 449–459. [Google Scholar] [CrossRef] - Shang, H.; Zhang, Y.; Lin, X.; Yu, J.X. Taming verification hardness: an efficient algorithm for testing subgraph isomorphism. Proc. Vldb Endow.
**2008**, 1, 364–375. [Google Scholar] [CrossRef] [Green Version] - Zhang, S.; Hu, M.; Yang, J. TreePi: A novel graph indexing method. In Proceedings of the IEEE International Conference on Data Engineering, Istanbul, Turkey, 15–20 April 2007; pp. 966–975. [Google Scholar]
- Jin, F.; Yang, Y.; Wang, S.; Xue, Y.; Yan, Z. TBSGM: A fast subgraph matching method on large-scale graphs. Int. J. Data Warehous. Min. (IJDWM)
**2018**, 14, 67–89. [Google Scholar] [CrossRef] - Chen, W.; Li, M.; Chen, Z. Efficient index construction algorithm for isomorphism of subgraphs. J. Harbin Inst. Technol.
**2019**, 40, 548–554. [Google Scholar] - Huang, Y.; Hong, J.; Jia, Z. Approximate subgraph matching based on double index. Comput. Appl.
**2012**, 32, 1994–1997. [Google Scholar] - Han, W.S.; Lee, J.; Lee, J.H. Turbo iso: Towards ultrafast and robust subgraph isomorphism search in large graph databases. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, New York, NY, USA, 22–27 June 2013; pp. 337–348. [Google Scholar]
- Bi, F.; Chang, L.; Lin, X.; Qin, L.; Zhang, W. Efficient subgraph matching by postponing cartesian products. In Proceedings of the 2016 International Conference on Management of Data, Pune, India, 11–13 March 2016; pp. 1199–1214. [Google Scholar]
- Hong, L.; Zou, L.; Lian, X.; Yu, P.S. Subgraph matching with set similarity in a large graph database. IEEE Trans. Knowl. Data Eng.
**2015**, 27, 2507–2521. [Google Scholar] [CrossRef] - Rivero, C.R.; Jamil, H.M. Efficient and Scalable Labeled Subgraph Matching Using SGMatch; Springer: New York, NY, USA, 2016; pp. 1–27. [Google Scholar]
- Wang, Z.; Li, T.; Xiong, N.; Pan, Y. A novel dynamic network data replication scheme based on historical access record and proactive deletion. J. Supercomput.
**2012**, 62, 227–250. [Google Scholar] [CrossRef] - Xiong, N.; Vasilakos, A.V.; Yang, L.T.; Song, L.; Pan, Y.; Kannan, R.; Li, Y. Comparative analysis of quality of service and memory usage for adaptive failure detectors in healthcare systems. IEEE J. Sel. Areas Commun.
**2009**, 27, 495–509. [Google Scholar] [CrossRef] - Xiong, N.; Jia, X.; Yang, L.T.; Vasilakos, A.V.; Li, Y.; Pan, Y. A distributed efficient flow control scheme for multi-rate multicast networks. IEEE Trans. Parallel Distrib. Syst.
**2010**, 21, 1254–1266. [Google Scholar] [CrossRef] - Liu, Y.; Ota, K.; Zhang, K.; Ma, M.; Xiong, N.; Liu, A.; Long, J. QTSAC: An energy-efficient MAC protocol for delay minimization in wireless sensor networks. IEEE Access
**2018**, 6, 8273–8291. [Google Scholar] [CrossRef] - Peng, P.; Zou, L.; Chen, L.; Zhao, D. Processing SPARQL queries over distributed RDF graphs. Vldb J. Int. J. Very Large Data Bases
**2016**, 25, 243–268. [Google Scholar] [CrossRef] [Green Version] - Husain, M.; Mcglothlin, J.; Masud, M.M.; Khan, L.; Thuraisingham, B.M. Heuristics-Based query processing for large RDF graphs using cloud computing. IEEE Trans. Knowl. Data Eng.
**2011**, 23, 1312–1327. [Google Scholar] [CrossRef] - Papailiou, N.; Tsoumakos, D.; Konstantinou, I.; Karras, P.; Koziris, N. H
_{2}RDF+: An efficient data management system for big RDF graphs. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, Snowbird, UT, USA, 22 June 2014. [Google Scholar] - Gao, J.; Lei, C.; Tian, L.; Ling, Y.; Chen, Z.; Song, B. Distributed Top-k subgraph matching in a big graph. In Proceedings of the 2018 IEEE International Conference on Big Data, Seattle, WA, USA, 10–13 December 2018; pp. 5325–5327. [Google Scholar]
- Hose, K.; Schenkel, R. WARP: Workload-aware replication and partitioning for RDF. In Proceedings of the IEEE International Conference on Data Engineering Workshops, Brisbane, QLD, Australia, 8–12 April 2013; pp. 1–6. [Google Scholar]
- Gurajada, S.; Seufert, S.; Miliaraki, I.; Theobald, M. TriAD: A Distributed Shared-Nothing RDF Engine Based on Asynchronous Message Passing. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Snowbird, UT, USA, 22 June 2014; pp. 289–300. [Google Scholar]
- Lee, K.; Liu, L. Scaling queries over big RDF graphs with semantic hash partitioning. Proc. Vldb Endow.
**2013**, 6, 1894–1905. [Google Scholar] [CrossRef] - Schwarte, A.; Haase, P.; Hose, K.; Schenkel, R.; Schmidt, M. FedX: Optimization techniques for federated query processing on linked data. In Proceedings of the International Conference on the Semantic Web, Bonn, Germany, 23–27 October 2011; pp. 601–616. [Google Scholar]
- Lin, B.; Guo, W.; Xiong, N.; Chen, G.; Vasilakos, A.V.; Zhang, H. A pretreatment workflow scheduling approach for big data applications in multi-cloud environments. IEEE Trans. Netw. Serv. Manag.
**2016**, 13, 1. [Google Scholar] [CrossRef] - Xiong, N.; Vasilakos, A.V.; Yang, L.T.; Wang, C.; Kannan, R.; Chang, C.; Pan, Y. A novel self-tuning feedback controller for active queue management supporting TCP flows. Inf. Sci.
**2010**, 180, 2249–2263. [Google Scholar] [CrossRef] - Nguyen, K. Inverse Location Theory with Ordered Median Function and Other Extensions; Epubli: Berlin, Germany, 2014. [Google Scholar]
- He, H.; Singh, A.K. Query language and access methods for graph databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2008, Vancouver, BC, Canada, 10–12 June 2008; pp. 405–418. [Google Scholar]
- Lee, J.; Han, W.S.; Kasperovics, R.; Lee, J.H. An in-depth comparison of subgraph isomorphism algorithms in graph databases. Proc. Vldb Endow.
**2013**, 6, 133–144. [Google Scholar] [CrossRef] [Green Version] - Sun, Z.; Wang, H.; Wang, H.; Shao, B.; Li, J. Efficient subgraph matching on billion node graphs. Proc. Vldb Endow.
**2012**, 5, 788–799. [Google Scholar] [CrossRef] [Green Version] - Huang, J.; Abadi, D. Leopard: lightweight edge-oriented partitioning and replication for dynamic graphs. Proc. Vldb Endow.
**2016**, 9, 540–551. [Google Scholar] [CrossRef] - Hall, B.H.; Jaffe, A.B.; Trajtenberg, M. The NBER Patent Citation Data File: Lessons, Insights and Methodological Tools; The MIT Press: Cambridge, MA, USA, 2001. [Google Scholar]
- Chakrabarti, D.; Zhan, Y.; Faloutsos, C. R-MAT: A recursive model for graph mining. In Proceedings of the Siam International Conference on Data Mining, Lake Buena Vista, FL, USA, 22–24 April 2004. [Google Scholar]

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Yuan, L.; Bin, J.; Pan, P.
Optimized Distributed Subgraph Matching Algorithm Based on Partition Replication. *Electronics* **2020**, *9*, 184.
https://doi.org/10.3390/electronics9010184

**AMA Style**

Yuan L, Bin J, Pan P.
Optimized Distributed Subgraph Matching Algorithm Based on Partition Replication. *Electronics*. 2020; 9(1):184.
https://doi.org/10.3390/electronics9010184

**Chicago/Turabian Style**

Yuan, Ling, Jiali Bin, and Peng Pan.
2020. "Optimized Distributed Subgraph Matching Algorithm Based on Partition Replication" *Electronics* 9, no. 1: 184.
https://doi.org/10.3390/electronics9010184