Accelerating Subgraph Matching Through Advanced Compression and Label Filtering
Abstract
1. Introduction
- (1)
- We propose a strict compressed graph node (CGN) technique to compress both the data graph and the query graph. Starting from the initial node, nodes belonging to the same equivalence class—based on their equivalence relationships—are compressed into a single node while preserving the original graph structure. This approach achieves a maximal compression of the graph data, effectively reducing the data graph to a smaller-scale representation.
- (2)
- We introduce a label count-based filtering (LCF) algorithm. Existing filtering methods can exclude some nodes that do not meet query conditions but often incur significant redundant validations during the subsequent verification phase, resulting in high memory overhead and extended validation times. By incorporating filtering conditions based on the labeling characteristics of dataset nodes during the filtering phase, our method further reduces the search space size, thereby enhancing overall query processing performance.
- (3)
- For large-scale graphs, we propose an adaptive tuning model that leverages caching to improve efficiency. This model accelerates subgraph matching (ASM) queries by reusing results from previous queries or by assessing the overall graph structure through a subset of frequently accessed (hot) nodes, forming an adaptive subgraph matching framework that dynamically adjusts according to dataset characteristics.
- (4)
- Extensive experiments on multiple real-world datasets have been conducted to compare the performance of ASM with three other subgraph matching algorithms, evaluating metrics such as execution time, callback counts, and average search time. Experimental results demonstrate that ASM achieves superior performance in terms of query processing time.
2. Background
2.1. Preliminaries
2.2. Related Work
- Direct-enumeration framework [28]. This type of subgraph matching algorithm does not generate auxiliary index structures for preprocessing in advance, but prunes the candidate points through some filtering strategies during the enumeration process, so it directly accesses the original data graph to match the vertices and edges of the query graph during the enumeration process.
- Indexing-enumeration framework [29]. This kind of subgraph matching algorithm preprocesses the data graph and the given query graph, and generates an auxiliary indexing structure [30] to maintain the candidate vertices and edges between candidate vertices, which is accessed during the enumeration process to match the vertices and edges of the query graph. During the enumeration process, when matching the vertices and edges of the query graph, they will access this index structure instead of directly accessing the original data graph, which can reduce many invalid accesses [31]. GADDI and SPath, as representatives of Offline-index, usually build indexes on data without communicating with real-time data sources, thus facilitating the filtering of candidate sets of query variables during the actual execution of the query; however, CFL [32] and CECI [33], as representatives of Online-index, need to dynamically build and update indexes in real-time or near-real-time data streams.
- Selection of root query node. In the process of subgraph matching, the selection of the starting node is an issue that needs to be emphasized [36]. If an appropriate matching starting node can be selected, some unmatched nodes can be excluded as early as possible, thus reducing the number of extended validations. If the starting node is not selected properly, it may cause a lot of redundant enumeration, consuming a lot of time to obtain the final correct result only through multiple extension verification. In this paper, the ranking rule is determined by the equation Rank(u) = freq(g,L(u))/d(u). Here, freq(g,L) denotes the number of nodes labeled L in the graph g, and d(u) denotes the degree of node u, i.e., the number of edges associated with node u.
- Determining the matching (visit) order. Usually, when determining the matching order in the subgraph matching process, priority is given to nodes with a smaller number of candidate nodes, i.e., nodes with smaller degrees. The advantage of this approach is that it reduces the search space and decreases the size of the intermediate result set [37]. In addition, if a wrong candidate node is found during traversal, backtracking can also be performed to explore the next node at a lower cost of trial and error.
- Generating the query tree. When a BFS traversal [38] is performed on the query graph starting from the root query node, this comes to create the query tree. The edges that appear on the query tree in the query graph are called tree edges (TEs). If an edge is on the query graph but not on the BFS tree, it is called a non-tree edge (NTE). BFS is used because, in some existing studies, it is shown that BFS minimizes the diameter of the search space [39].
3. Adaptive Subgraph Matching Architecture
3.1. Compressed Graph Nodes (CGNs) Algorithm
Algorithm 1 Compressed graph nodes (CGNs). |
Require: Data graph g, query graph q Ensure: Compressed graph g′, q′
|
3.2. Efficiency Filtering Mechanisms
Algorithm 2 Extract candidates algorithm. |
Require: Data graph g, query graph q Ensure: candidate set C
|
3.3. Adaptive Subgraph Matching (ASM) Algorithm
Algorithm 3 Adaptive subgraph matching (ASM). |
Require: Data graph g, query graph q Ensure: Subgraph isomorphism of all query graphs in a data graph
|
4. Experiments
4.1. Experimental Setting
4.1.1. Experimental Environment
4.1.2. Datasets
- Dblp Dataset: A dataset of scholarly literature in the field of computer science, containing metadata information on a large number of scholarly and conference papers in computer science, information technology, and related fields. The dataset contains 317,080 nodes and 1,049,866 edges.
- YouTube Dataset: A dataset containing user behavior and video information. Based on the videos uploaded and recommended by a user in YouTube, a relationship graph is built to get the content uploaded or recommended by other users with similar preferences to that user, and finally the content is pushed to that user. This dataset contains 1,134,890 nodes and 2,987,624 edges.
- Human Dataset: A graph dataset describing human protein interactions, where each node represents a protein entity and edges represent interactions between proteins. The dataset contains 4674 nodes and 86,292 edges.
4.1.3. Evaluation Metrics
- ASM: The adaptive subgraph matching algorithm we proposed in this paper contains the efficiency Compressed Graph Node (CGN) algorithm and the label filter mechanism.
- CFL [32]: One of the most advanced algorithms. Effectively reduces the search space through two phases: join filtering and locally sensitive filtering. Uses tree-structured indexing and path-based sorting strategy.
- GraphQL [13]: Employing a neighborhood signature filter and a left-depth connected sorting strategy, the method models the query as a left-depth connected tree, where the leaf nodes are the set of candidate vertices.
- CECI [33]: One of the state-of-the-art algorithms. Compact embedded clustering index is designed to sort the tree edge and non-tree edge candidate nodes using forward BFS traversal filtering and reverse BFS refinement process. Compared with the other three state-of-the-art algorithms, the proposed method demonstrates clear improvements in execution time, the number of algorithm callbacks, and the average query time during the matching process. For each dataset, results are reported for query graph node sizes ranging from 3 to 7.
- Execution time: the average time for processing the query graph in the query set, excluding the time for loading data from disk, which mainly includes filtering nodes, indexing time, and enumeration time.
- Execution algorithm efficiency: the average number of callbacks and the average search time are used as metrics. In program operation, the main time overhead lies in the recursive callbacks of the algorithm, such as depth-first search, node order matching, etc. The fewer the number of callbacks or the shorter the search time, the higher the efficiency of the algorithm.
4.2. Evaluations on Execution Time
4.3. Relative Performance Comparison Evaluation
4.3.1. Evaluations on the Counts of Callbacks
4.3.2. Evaluations on the Average Callbacks Time
5. Conclusions
Author Contributions
Funding
Conflicts of Interest
References
- Guo, M.; Chi, C.H.; Zheng, H.; He, J.; Zhang, X. A subgraph isomorphism-based attack towards social networks. In Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, Melbourne, Australia, 14–17 December 2021; pp. 520–528. [Google Scholar]
- Sahoo, T.R.; Patra, S.; Vipsita, S. Decision tree classifier based on topological characteristics of subgraph for the mining of protein complexes from large scale PPI networks. Comput. Biol. Chem. 2023, 106, 107935. [Google Scholar] [CrossRef]
- Xu, Q.; Wang, X.; Li, J.; Gan, Y.; Chai, L.; Wang, J. StarMR: An efficient star-decomposition based query processor for SPARQL basic graph patterns using MapReduce. In Proceedings of the Web and Big Data: Second International Joint Conference, APWeb-WAIM 2018, Macau, China, 23–25 July 2018; Proceedings, Part I 2. Springer: Berlin/Heidelberg, Germany, 2018; pp. 415–430. [Google Scholar]
- Kim, H.; Choi, Y.; Park, K.; Lin, X.; Hong, S.H.; Han, W.S. Fast subgraph query processing and subgraph matching via static and dynamic equivalences. VLDB J. 2023, 32, 343–368. [Google Scholar] [CrossRef]
- Hartmanis, J. Computers and intractability: A guide to the theory of np-completeness (michael r. garey and david s. johnson). Siam Rev. 1982, 24, 90. [Google Scholar] [CrossRef]
- Sun, Y.; Li, G.; Du, J.; Ning, B.; Chen, H. A subgraph matching algorithm based on subgraph index for knowledge graph. Front. Comput. Sci. 2022, 16, 163606. [Google Scholar] [CrossRef]
- Ba, L.; Liang, P.; Gu, J. Subgraph Matching Algorithm Based on Preprocessing-enumeration. Comput. Technol. Dev. 2023, 33, 85–91. [Google Scholar]
- Zeng, L.; Jiang, Y.; Lu, W.; Zou, L. Deep analysis on subgraph isomorphism. arXiv 2020, arXiv:2012.06802. [Google Scholar]
- Choi, Y.; Park, K.; Kim, H. BICE: Exploring Compact Search Space by Using Bipartite Matching and Cell-Wide Verification. Proc. VLDB Endow. 2023, 16, 2186–2198. [Google Scholar] [CrossRef]
- Ullmann, J.R. An algorithm for subgraph isomorphism. J. ACM (JACM) 1976, 23, 31–42. [Google Scholar] [CrossRef]
- Cordella, L.P.; Foggia, P.; Sansone, C.; Vento, M. A (sub) graph isomorphism algorithm for matching large graphs. IEEE Trans. Pattern Anal. Mach. Intell. 2004, 26, 1367–1372. [Google Scholar] [CrossRef]
- Shang, H.; Zhang, Y.; Lin, X.; Yu, J.X. Taming verification hardness: An efficient algorithm for testing subgraph isomorphism. Proc. VLDB Endow. 2008, 1, 364–375. [Google Scholar] [CrossRef]
- He, H.; Singh, A.K. Graphs-at-a-time: Query language and access methods for graph databases. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, Vancouver, BC, Canada, 9–12 June 2008; pp. 405–418. [Google Scholar]
- Zhao, P.; Han, J. On graph query optimization in large networks. Proc. VLDB Endow. 2010, 3, 340–351. [Google Scholar] [CrossRef]
- Han, W.S.; Lee, J.; Lee, J.H. Turboiso: Towards ultrafast and robust subgraph isomorphism search in large graph databases. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, New York, NY, USA, 22–27 June 2013; pp. 337–348. [Google Scholar]
- Sun, S.; Sun, X.; Che, Y.; Luo, Q.; He, B. Rapidmatch: A holistic approach to subgraph query processing. Proc. VLDB Endow. 2020, 14, 176–188. [Google Scholar] [CrossRef]
- Chang, J.S.; Luo, Y.F.; Su, K.Y. GPSM: A generalized probabilistic semantic model for ambiguity resolution. In Proceedings of the 30th Annual Meeting of the Association for Computational Linguistics, Newark, DE, USA, 28 June–2 July 1992; pp. 177–184. [Google Scholar]
- Sun, S.; Luo, Q. In-memory subgraph matching: An in-depth study. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, Portland, OR, USA, 14–19 June 2020; pp. 1083–1098. [Google Scholar]
- Dann, J.; Götz, T.; Ritter, D.; Giceva, J.; Fröning, H. GraphMatch: Subgraph Query Processing on FPGAs. arXiv 2024, arXiv:2402.17559. [Google Scholar] [CrossRef]
- Kamada, T.; Kawai, S. An algorithm for drawing general undirected graphs. Inf. Process. Lett. 1989, 31, 7–15. [Google Scholar] [CrossRef]
- Yang, D.; Ge, Y.; Nguyen, T.; Molitor, D.; Moorman, J.D.; Bertozzi, A.L. Structural equivalence in subgraph matching. IEEE Trans. Netw. Sci. Eng. 2023, 10, 1846–1862. [Google Scholar] [CrossRef]
- Sun, X.; Sun, S.; Luo, Q.; He, B. An in-depth study of continuous subgraph matching. Proc. VLDB Endow. 2022, 15, 1403–1416. [Google Scholar] [CrossRef]
- Yiu, M.L.; Papadias, D.; Mamoulis, N.; Tao, Y. Reverse nearest neighbors in large graphs. IEEE Trans. Knowl. Data Eng. 2006, 18, 540–553. [Google Scholar] [CrossRef]
- Wang, X.; Chen, W.; Yang, Y.; Zhang, X.; Feng, Z. Research on Knowledge Graph Partitioning Algorithms: A Survey. Chin. J. Comput. 2021, 44, 235–260. [Google Scholar]
- Liu, G.; Inae, E.; Zhao, T.; Xu, J.; Luo, T.; Jiang, M. Data-centric learning from unlabeled graphs with diffusion model. Adv. Neural Inf. Process. Syst. 2024, 36, 21039–21057. [Google Scholar]
- Lan, Z.; Yu, L.; Yuan, L.; Wu, Z.; Niu, Q.; Ma, F. Sub-gmn: The neural subgraph matching network model. In Proceedings of the 2023 16th IEEE International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI), Taizhou, China, 28–30 October 2023; pp. 1–7. [Google Scholar]
- Jian, X.; Li, Z.; Chen, L. Suff: Accelerating subgraph matching with historical data. Proc. VLDB Endow. 2023, 16, 1699–1711. [Google Scholar] [CrossRef]
- Borgwardt, S.; Viss, C. A polyhedral model for enumeration and optimization over the set of circuits. Discret. Appl. Math. 2022, 308, 68–83. [Google Scholar] [CrossRef]
- He, J.; Chen, Y.; Liu, Z.; Li, D. Optimizing subgraph retrieval and matching with an efficient indexing scheme. Knowl. Inf. Syst. 2024, 66, 6815–6843. [Google Scholar] [CrossRef]
- Sun, Z.; Zhou, X.; Li, G. Learned index: A comprehensive experimental evaluation. Proc. VLDB Endow. 2023, 16, 1992–2004. [Google Scholar] [CrossRef]
- Yu, M.M.; Chen, L.H. Productivity change of airlines: A global total factor productivity index with network structure. J. Air Transp. Manag. 2023, 109, 102403. [Google Scholar] [CrossRef]
- Gaihre, A.; Wu, Z.; Yao, F.; Liu, H. XBFS: eXploring runtime optimizations for breadth-first search on GPUs. In Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing, Phoenix, AZ, USA, 22–26 June 2019; pp. 121–131. [Google Scholar]
- Bhattarai, B.; Liu, H.; Huang, H.H. Ceci: Compact embedding cluster index for scalable subgraph matching. In Proceedings of the 2019 International Conference on Management of Data, Amsterdam, The Netherlands, 30 June–5 July 2019; pp. 1447–1462. [Google Scholar]
- Chen, K.; Liu, S.; Zhu, T.; Qiao, J.; Su, Y.; Tian, Y.; Zheng, T.; Zhang, H.; Feng, Z.; Ye, J.; et al. Improving expressivity of gnns with subgraph-specific factor embedded normalization. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Long Beach, CA, USA, 6–10 August 2023; pp. 237–249. [Google Scholar]
- Liu, T.; Li, D. Endgraph: An efficient distributed graph preprocessing system. In Proceedings of the 2022 IEEE 42nd International Conference on Distributed Computing Systems (ICDCS), Bologna, Italy, 10–13 July 2022; pp. 111–121. [Google Scholar]
- Turner, M.; Berthold, T.; Besançon, M.; Koch, T. Cutting plane selection with analytic centers and multiregression. In Proceedings of the International Conference on Integration of Constraint Programming, Artificial Intelligence, and Operations Research, Nice, France, 29 May–1 June 2023; Springer: Berlin/Heidelberg, Germany, 2023; pp. 52–68. [Google Scholar]
- Hu, M.; Zhou, Y. Dynamic type matching. Manuf. Serv. Oper. Manag. 2022, 24, 125–142. [Google Scholar] [CrossRef]
- Bi, F.; Chang, L.; Lin, X.; Qin, L.; Zhang, W. Efficient subgraph matching by postponing cartesian products. In Proceedings of the 2016 International Conference on Management of Data, San Francisco, CA, USA, 26 June–1 July 2016; pp. 1199–1214. [Google Scholar]
- Levinas, I.; Scherz, R.; Louzoun, Y. BFS-based distributed algorithm for parallel local-directed subgraph enumeration. J. Complex Netw. 2022, 10, cnac051. [Google Scholar] [CrossRef]
- Ren, X.; Wang, J. Exploiting vertex relationships in speeding up subgraph isomorphism over large graphs. Proc. VLDB Endow. 2015, 8, 617–628. [Google Scholar] [CrossRef]
- Qin, Y.; Wang, X.; Hao, W.; Liu, P.; Song, Y.; Zhang, Q. OntoCA: Ontology-Aware Caching for Distributed Subgraph Matching. In Proceedings of the Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint International Conference on Web and Big Data, Nanjing, China, 25–27 August 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 527–535. [Google Scholar]
- Leskovec, J.; Krevl, A. SNAP Datasets: Stanford Large Network Dataset Collection. 2014. Available online: http://snap.stanford.edu/data (accessed on 10 April 2025).
Community | Model | Category | Methodology | Algorithms/System |
---|---|---|---|---|
Dataset | Direct enumeration | Ullmann [10], VF2 [11], GraphQL [13] | ||
Exploration | Backtracking search | Offline-index | GADDI, SPath [14] | |
Online-index enumeration | CFL, CECI, DPISO | |||
Join | Multi-way join | Pair-wise join | postgreSQL, Neo4j, GPSM [17] | |
Worst-case optional join | LogicBlox, GraphFlow, EmptyHeaded |
Notations | Descriptions |
---|---|
q,G | Query graph and data graph |
V(g),E(g),Σ | Vertex set, edge set, and label set of a graph g |
d(u),L(u),N(u) | Degree, label, and neighbor vertex |
e(u,v) | The edge between u and vs. |
E(qt),E(qnt) | Tree edge and non-tree edge |
Cu | Set of candidate vertices of u in C |
φ and φ′ | Match order and index order |
, | Neighbors of u before(after) u in φ |
Experimental Environment | Setting |
---|---|
CPU | Intel(R) Core(TM) i3-6100 |
Main frequency | 3.70 GHz |
Random access memory (RAM) | 7.7 GB |
Disk capacity | 1 T |
System type | 64 bit |
Operating system | Ubuntu 22.04.2 LTS |
Programming environment | Microsoft VS Code1.84.1 |
Programming language | C++ |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Chai, Y.; Li, J.; Zhang, Q. Accelerating Subgraph Matching Through Advanced Compression and Label Filtering. Algorithms 2025, 18, 541. https://doi.org/10.3390/a18090541
Chai Y, Li J, Zhang Q. Accelerating Subgraph Matching Through Advanced Compression and Label Filtering. Algorithms. 2025; 18(9):541. https://doi.org/10.3390/a18090541
Chicago/Turabian StyleChai, Yanfeng, Jiashu Li, and Qiang Zhang. 2025. "Accelerating Subgraph Matching Through Advanced Compression and Label Filtering" Algorithms 18, no. 9: 541. https://doi.org/10.3390/a18090541
APA StyleChai, Y., Li, J., & Zhang, Q. (2025). Accelerating Subgraph Matching Through Advanced Compression and Label Filtering. Algorithms, 18(9), 541. https://doi.org/10.3390/a18090541